Skip to content

zhipeng-wei/EmojiAttack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ICML 2025

Zhipeng Wei, Yuqi Liu, N. Benjamin Erichson

Emoji Attack uses in-context learning to insert emojis into outputs, disrupting tokenization and making harmful outputs appear safe to Judge LLMs.

Overview

If you use our method in your research, please consider citing

@inproceedings{
wei2025emoji,
title={Emoji Attack: Enhancing Jailbreak Attacks Against Judge {LLM} Detection},
author={Zhipeng Wei and Yuqi Liu and N. Benjamin Erichson},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=Q0rKYiVEZq}
}

Environment

conda create --name EmojiAttack --file requirements.txt

Update

  • We released a Colab demo demonstrating white-box emoji attacks on judge LLMs.

Dataset

402 Offensive Phrases

We utilize en from List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words repository, excluding the final emoji entry.

1,432 Harmful Responses

We implement this dataset using class HarmfulResponseDataset in our_datasets.py, which includes harmful responses from three resources:

Jailbreaking Prompts

We utilize prompts from the EasyJailbreak benchmark and refer to their experimental results under different attacks in EasyJailbraking-Results.

Judge LLMs

We implement 10 Judge LLMs in judge: Llama Guard, Llama Guard 2, ShieldLLM, WildGuard, GPT-3.5, Gemini, Claude, DeepSeek, o3-mini, GPT-4.

Experiments

Setting Environment Variable

Before running experiments, users should configure the variables in env.sh:

  • API Key of Commercial LLMs: OPENAI_API_KEY, GEMINI_API_KEY, ANTHROPIC_API_KEY, DEEPSEEK_API_KEY
  • Save Path of Open-source LLMs: CACHE_DIR, HUGGINGFACE_TOKEN, LLAMA_GUARD_PATH (Llama Guard is converted to Hugging Face format, so it requires a separate path).
    • python convert_llama_weights_to_hf.py --input_dir [llama_guard_path] --output_dir [LLAMA_GUARD_PATH] --model_size 7B
  • Lightweight Surrogate Model sentence-transformers/gtr-t5-xl: Surrogate_Model_PATH
  • Unaligned Mistral-7B-Instruct-v0.1: CACHEDIR, MyToken
  • Output Path: OUTPUT_BASE_PATH

Then run source env.sh

Preliminary

Following Sec 5.1 of the paper "PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails", we obtain benign prompts' responses by unaligned Mistral-7B-Instruct-v0.1, so as to perform in-context learning.

python InContext_ObtainOutputs.py

Alternatively, the generated responses can be directly accessed from ./in-context-data/safe_responses.json.

Table 1 [Existing Jailbreaking Prompts]

Generate outputs using jailbreaking prompts, with or without the one-shot emoji insertion example, and these outputs will be saved to OUTPUT_BASE_PATH:

python BlackAttack_NormalOutput.py 
python BlackAttack_EmojiOutput.py 

Leverage Judge LLMs to predict:

python BlackAttack_JudgePred_Normal.py
python BlackAttack_JudgePred_Emoji.py

Table 2 [Different Emojis]

Others

Table 3 [White-box Setting]

python Whitebox_Emojis.py

Table 4 (Compared with GCG)

python gcg_exp_targetLLM.py
python gcg_exp_JudgeLLM.py

Figure 2

python offensive_words_judge.py

Figure 3

python offensive_words_probability.py

About

Emoji Attack [ICML 2025]

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors