Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection

ICML 2025

Zhipeng Wei, Yuqi Liu, N. Benjamin Erichson

Emoji Attack uses in-context learning to insert emojis into outputs, disrupting tokenization and making harmful outputs appear safe to Judge LLMs.

If you use our method in your research, please consider citing

@inproceedings{
wei2025emoji,
title={Emoji Attack: Enhancing Jailbreak Attacks Against Judge {LLM} Detection},
author={Zhipeng Wei and Yuqi Liu and N. Benjamin Erichson},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=Q0rKYiVEZq}
}

Environment

conda create --name EmojiAttack --file requirements.txt

Update

We released a Colab demo demonstrating white-box emoji attacks on judge LLMs.

Dataset

402 Offensive Phrases

We utilize en from List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words repository, excluding the final emoji entry.

1,432 Harmful Responses

We implement this dataset using class HarmfulResponseDataset in our_datasets.py, which includes harmful responses from three resources:

harmful_string in AdvBench.
labeled_responses in Self-Defense.
- Select responses manually labeled as harmfulness as the used dataset.
Red Teaming Data in Anthropics.
- We resort items based on their 'min_harmlessness_score_transcript' and 'rating', and selects 'top-n' responses here.
- Download it from link, and place it in the datasets folder.

Jailbreaking Prompts

We utilize prompts from the EasyJailbreak benchmark and refer to their experimental results under different attacks in EasyJailbraking-Results.

Judge LLMs

We implement 10 Judge LLMs in judge: Llama Guard, Llama Guard 2, ShieldLLM, WildGuard, GPT-3.5, Gemini, Claude, DeepSeek, o3-mini, GPT-4.

Experiments

Setting Environment Variable

Before running experiments, users should configure the variables in env.sh:

API Key of Commercial LLMs: OPENAI_API_KEY, GEMINI_API_KEY, ANTHROPIC_API_KEY, DEEPSEEK_API_KEY
Save Path of Open-source LLMs: CACHE_DIR, HUGGINGFACE_TOKEN, LLAMA_GUARD_PATH (Llama Guard is converted to Hugging Face format, so it requires a separate path).
- python convert_llama_weights_to_hf.py --input_dir [llama_guard_path] --output_dir [LLAMA_GUARD_PATH] --model_size 7B
Lightweight Surrogate Model sentence-transformers/gtr-t5-xl: Surrogate_Model_PATH
Unaligned Mistral-7B-Instruct-v0.1: CACHEDIR, MyToken
Output Path: OUTPUT_BASE_PATH

Then run source env.sh

Preliminary

Following Sec 5.1 of the paper "PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails", we obtain benign prompts' responses by unaligned Mistral-7B-Instruct-v0.1, so as to perform in-context learning.

python InContext_ObtainOutputs.py

Alternatively, the generated responses can be directly accessed from ./in-context-data/safe_responses.json.

Table 1 [Existing Jailbreaking Prompts]

Generate outputs using jailbreaking prompts, with or without the one-shot emoji insertion example, and these outputs will be saved to OUTPUT_BASE_PATH:

python BlackAttack_NormalOutput.py 
python BlackAttack_EmojiOutput.py

Leverage Judge LLMs to predict:

python BlackAttack_JudgePred_Normal.py
python BlackAttack_JudgePred_Emoji.py

Table 2 [Different Emojis]

Modify the inserted emoji in the Chatgpt_Instruction function within BlackAttack_EmojiOutput.py.
Replace process_files in Line 46 of BlackAttack_EmojiOutput.py.

Others

Table 3 [White-box Setting]

python Whitebox_Emojis.py

Table 4 (Compared with GCG)

python gcg_exp_targetLLM.py
python gcg_exp_JudgeLLM.py

Figure 2

python offensive_words_judge.py

Figure 3

python offensive_words_probability.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection

Environment

Update

Dataset

402 Offensive Phrases

1,432 Harmful Responses

Jailbreaking Prompts

Judge LLMs

Experiments

Setting Environment Variable

Preliminary

Table 1 [Existing Jailbreaking Prompts]

Table 2 [Different Emojis]

Others

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
EasyJailbreaking-Results		EasyJailbreaking-Results
datasets		datasets
in-context-data		in-context-data
judge		judge
BlackAttack_EmojiOutput.py		BlackAttack_EmojiOutput.py
BlackAttack_JudgePred_Emoji.py		BlackAttack_JudgePred_Emoji.py
BlackAttack_JudgePred_Normal.py		BlackAttack_JudgePred_Normal.py
BlackAttack_NormalOutput.py		BlackAttack_NormalOutput.py
InContext_ObtainOutputs.py		InContext_ObtainOutputs.py
ReadMe.md		ReadMe.md
Whitebox_Emojis.py		Whitebox_Emojis.py
convert_llama_weights_to_hf.py		convert_llama_weights_to_hf.py
env.sh		env.sh
gcg_exp_JudgeLLM.py		gcg_exp_JudgeLLM.py
gcg_exp_targetLLM.py		gcg_exp_targetLLM.py
offensive_words_judge.py		offensive_words_judge.py
offensive_words_probability.py		offensive_words_probability.py
our_datasets.py		our_datasets.py
overview.png		overview.png
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection

Environment

Update

Dataset

402 Offensive Phrases

1,432 Harmful Responses

Jailbreaking Prompts

Judge LLMs

Experiments

Setting Environment Variable

Preliminary

Table 1 [Existing Jailbreaking Prompts]

Table 2 [Different Emojis]

Others

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages