This website showcases our work on π-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents. The website is adapted from Nerfies website.
π-Teaming is a scalable framework that systematically explores how seemingly harmless interactions with language models can escalate into harmful outcomes. Our framework achieves state-of-the-art multi-turn jailbreak effectiveness with success rates up to 98.1% across leading models.
- Multi-turn jailbreak framework with adaptive multi-agents
- State-of-the-art attack success rates on various models
- XGuard-Train: A comprehensive safety training dataset
- Systematic approach to understanding and mitigating conversational attacks
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
