Obfuscated Adversarial Training

This repo is for training and evaluation of the OAT jailbreak defense (Obfuscated Adversarial Training) proposed by Bailey et al. OAT is a post-training routine to improve the adversarial robustness of latent-space probes.

The broader aim of this repo is to provide an evaluation framework for comparing LLM jailbreaks vs finetuning defenses vs latent-space probes, against FLOP cost a la Boreiko et al and StrongREJECT score per Souley et al.

Repo overview:

oat_evaluation/ contains the evaluation framework.
oat_training/ adapts Bailey et al's original OAT training code.

Run installation.sh to get started. Then oat_evaluation/scripts/eval_scripts/full_eval_test_multi_gpu.py to kick off evals.

Early results

More compute-expensive OAT runs (more OAT steps or more mid-training attack steps) produce predictably lower StrongREJECT scores against soft-suffix attacks.

Details: The graph below aggregates 1077 tests. Each test involves learning a universal soft-suffix against 1000 harmful samples, then evaluating the jailbreak against 100 more, extracting StrongREJECT (SR) score on outputs. Attacker FLOPs vary with suffix token length and number of attacker training steps.

However, results so far against hard-token attacks are disappointing, with OAT performing no better against 20 PAIR attacks than the base model, despite Circuit Breakers overcoming them easily:

Status

Currently implemented attacks:

Soft-suffixes
Soft-perturbations
PAIR
FLRT
GCG

Currently acquired defenses:

OAT
LAT (checkpoints)
Circuit Breakers (checkpoints)
Refat (implementation)

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
oat_evaluation		oat_evaluation
oat_training		oat_training
.gitignore		.gitignore
README.md		README.md
installation.sh		installation.sh
installation_evals.sh		installation_evals.sh
requirements_evals.txt		requirements_evals.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Obfuscated Adversarial Training

Early results

Status

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

mgm52/oat-2025

Folders and files

Latest commit

History

Repository files navigation

Obfuscated Adversarial Training

Early results

Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages