Skip to content
/ oat-2025 Public

Comparing LLM attacks (jailbreaks) & defenses (post-training routines + probes), focusing on OAT.

Notifications You must be signed in to change notification settings

mgm52/oat-2025

Repository files navigation

Obfuscated Adversarial Training

This repo is for training and evaluation of the OAT jailbreak defense (Obfuscated Adversarial Training) proposed by Bailey et al. OAT is a post-training routine to improve the adversarial robustness of latent-space probes.

The broader aim of this repo is to provide an evaluation framework for comparing LLM jailbreaks vs finetuning defenses vs latent-space probes, against FLOP cost a la Boreiko et al and StrongREJECT score per Souley et al.

Repo overview:

  • oat_evaluation/ contains the evaluation framework.
  • oat_training/ adapts Bailey et al's original OAT training code.

Run installation.sh to get started. Then oat_evaluation/scripts/eval_scripts/full_eval_test_multi_gpu.py to kick off evals.


Early results

More compute-expensive OAT runs (more OAT steps or more mid-training attack steps) produce predictably lower StrongREJECT scores against soft-suffix attacks.

Details: The graph below aggregates 1077 tests. Each test involves learning a universal soft-suffix against 1000 harmful samples, then evaluating the jailbreak against 100 more, extracting StrongREJECT (SR) score on outputs. Attacker FLOPs vary with suffix token length and number of attacker training steps.

However, results so far against hard-token attacks are disappointing, with OAT performing no better against 20 PAIR attacks than the base model, despite Circuit Breakers overcoming them easily:

image


Status

Currently implemented attacks:

  • Soft-suffixes
  • Soft-perturbations
  • PAIR
  • FLRT
  • GCG

Currently acquired defenses:

About

Comparing LLM attacks (jailbreaks) & defenses (post-training routines + probes), focusing on OAT.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •