AI Safety Research

Discovering and
mitigating LLM
failure modes

We systematically uncover and understand the hidden vulnerabilities of large language models to build safer, more reliable AI systems.

Research Focus

/ Core methodologies

01

Adaptive Stress Testing

Using reinforcement learning agents to systematically discover vulnerabilities and failure modes in LLMs

02

Deterministic Methods

Ensuring reproducibility through batch-invariant inference to reliably trigger and analyze genuine model weaknesses

03

Failure Mode Analysis

Analyzing hidden state representations and decoding adversarial triggers to identify patterns, create taxonomies, and reveal linguistic structures that cause failures

Founding Team

/ Researchers from Norwegian University of Science and Technology (NTNU) and UC Berkeley