ICML 2025
Zhipeng Wei, Yuqi Liu, N. Benjamin Erichson
Emoji Attack uses in-context learning to insert emojis into outputs, disrupting tokenization and making harmful outputs appear safe to Judge LLMs.
If you use our method in your research, please consider citing
@inproceedings{
wei2025emoji,
title={Emoji Attack: Enhancing Jailbreak Attacks Against Judge {LLM} Detection},
author={Zhipeng Wei and Yuqi Liu and N. Benjamin Erichson},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=Q0rKYiVEZq}
}
conda create --name EmojiAttack --file requirements.txt
- We released a Colab demo demonstrating white-box emoji attacks on judge LLMs.
We utilize en from List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words repository, excluding the final emoji entry.
We implement this dataset using class HarmfulResponseDataset in our_datasets.py, which includes harmful responses from three resources:
- harmful_string in AdvBench.
- labeled_responses in Self-Defense.
- Select responses manually labeled as harmfulness as the used dataset.
- Red Teaming Data in Anthropics.
- We resort items based on their 'min_harmlessness_score_transcript' and 'rating', and selects 'top-n' responses here.
- Download it from link, and place it in the datasets folder.
We utilize prompts from the EasyJailbreak benchmark and refer to their experimental results under different attacks in EasyJailbraking-Results.
We implement 10 Judge LLMs in judge: Llama Guard, Llama Guard 2, ShieldLLM, WildGuard, GPT-3.5, Gemini, Claude, DeepSeek, o3-mini, GPT-4.
Before running experiments, users should configure the variables in env.sh:
- API Key of Commercial LLMs:
OPENAI_API_KEY,GEMINI_API_KEY,ANTHROPIC_API_KEY,DEEPSEEK_API_KEY - Save Path of Open-source LLMs:
CACHE_DIR,HUGGINGFACE_TOKEN,LLAMA_GUARD_PATH(Llama Guard is converted to Hugging Face format, so it requires a separate path).python convert_llama_weights_to_hf.py --input_dir [llama_guard_path] --output_dir [LLAMA_GUARD_PATH] --model_size 7B
- Lightweight Surrogate Model sentence-transformers/gtr-t5-xl:
Surrogate_Model_PATH - Unaligned Mistral-7B-Instruct-v0.1:
CACHEDIR,MyToken - Output Path:
OUTPUT_BASE_PATH
Then run source env.sh
Following Sec 5.1 of the paper "PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails", we obtain benign prompts' responses by unaligned Mistral-7B-Instruct-v0.1, so as to perform in-context learning.
python InContext_ObtainOutputs.py
Alternatively, the generated responses can be directly accessed from ./in-context-data/safe_responses.json.
Generate outputs using jailbreaking prompts, with or without the one-shot emoji insertion example, and these outputs will be saved to OUTPUT_BASE_PATH:
python BlackAttack_NormalOutput.py
python BlackAttack_EmojiOutput.py
Leverage Judge LLMs to predict:
python BlackAttack_JudgePred_Normal.py
python BlackAttack_JudgePred_Emoji.py
- Modify the inserted emoji in the
Chatgpt_Instruction functionwithin BlackAttack_EmojiOutput.py. - Replace
process_filesin Line 46 of BlackAttack_EmojiOutput.py.
Table 3 [White-box Setting]
python Whitebox_Emojis.py
Table 4 (Compared with GCG)
python gcg_exp_targetLLM.py
python gcg_exp_JudgeLLM.py
Figure 2
python offensive_words_judge.py
Figure 3
python offensive_words_probability.py
