This repo is for training and evaluation of the OAT jailbreak defense (Obfuscated Adversarial Training) proposed by Bailey et al. OAT is a post-training routine to improve the adversarial robustness of latent-space probes.
The broader aim of this repo is to provide an evaluation framework for comparing LLM jailbreaks vs finetuning defenses vs latent-space probes, against FLOP cost a la Boreiko et al and StrongREJECT score per Souley et al.
Repo overview:
oat_evaluation/contains the evaluation framework.oat_training/adapts Bailey et al's original OAT training code.
Run installation.sh to get started. Then oat_evaluation/scripts/eval_scripts/full_eval_test_multi_gpu.py to kick off evals.
More compute-expensive OAT runs (more OAT steps or more mid-training attack steps) produce predictably lower StrongREJECT scores against soft-suffix attacks.
Details: The graph below aggregates 1077 tests. Each test involves learning a universal soft-suffix against 1000 harmful samples, then evaluating the jailbreak against 100 more, extracting StrongREJECT (SR) score on outputs. Attacker FLOPs vary with suffix token length and number of attacker training steps.
However, results so far against hard-token attacks are disappointing, with OAT performing no better against 20 PAIR attacks than the base model, despite Circuit Breakers overcoming them easily:
Currently implemented attacks:
- Soft-suffixes
- Soft-perturbations
- PAIR
- FLRT
- GCG
Currently acquired defenses:
- OAT
- LAT (checkpoints)
- Circuit Breakers (checkpoints)
- Refat (implementation)
