A research toolkit providing a bag of techniques to jailbreak OpenAI’s open-source models.
# note that we use A100 with cu121
uv venv
source .venv/bin/activate
uv pip install -e .
python reproduce_finding_issues.py --json_path demo/breakoss-findings-1-cotbypass.json
python reproduce_finding_issues.py --json_path demo/breakoss-findings-2-fake-over-refusal-cotbypass.json
python reproduce_finding_issues.py --json_path demo/breakoss-findings-3-coercive-optimization.json
python reproduce_finding_issues.py --json_path demo/breakoss-findings-4-intent-hijack.json
python reproduce_finding_issues.py --json_path demo/breakoss-findings-5-plan-injection.json
- run the first method
Structural CoT Bypass
python examples/1_cot_bypass.py one_example- run the second method
Fake Over-Refusal
python examples/2_fake_overrefusal.py one_example- run the third method
Coercive Optimization
python example/3_gcg.py
python example/3-1_gcg_transfer.py- run the fourth method
Intent Hijack
python examples/4_intent_hijack.py one_example- run the fifth method
Plan Injection
python examples/5_plan_injection.py one_example- Evaluate Structural CoT Bypass on StrongReject or HarmfulBehaviors
python examples/1_cot_bypass.py main --dataset_name=StrongReject
python examples/1_cot_bypass.py main --dataset_name=HarmfulBehaviors- Evaluate Intent Hijack on StrongReject
python examples/4_intent_hijack.py eval_on_strongreject- Evaluate Plan Injection on StrongReject
python examples/5_plan_injection.py eval_on_strongrejectThe code in this package is licensed under the MIT License.
If you use this framework in your research, please cite:
@article{chen2025bag,
title={Bag of Tricks for Subverting Reasoning-based Safety Guardrails},
author={Chen, Shuo and Han, Zhen and Chen, Haokun and He, Bailan and Si, Shengyun and Wu, Jingpei and Torr, Philip and Tresp, Volker and Gu, Jindong},
journal={arXiv preprint arXiv:2510.11570},
year={2025}
}