Refusal Direction is Universal Across Safety-Aligned Languages

This repository contains the code and dataset for the paper "Refusal Direction is Universal Across Safety-Aligned Languages".

PolyRefuse Dataset

The PolyRefuse dataset is a multilingual safety evaluation dataset covering 14 languages: ar, de, en, es, fr, it, ja, ko, nl, pl, ru, th, zh, yo.

You can find the dataset in the PolyRefuse/ directory, which contains:

Harmful prompts (train/val/test splits) translated to all languages
Harmless prompts (train/val/test splits) translated to all languages
Back-translated versions for analysis

Setup

Installation

source setup.sh

Install the evaluation harness from source

cd lm-evaluation-harness
pip install -e .

Usage

Running Experiments

Refusal Vector Ablation

# Configure your experiment settings in configs/cfg.yaml
python -m pipeline.run_pipeline --config configs/cfg.yaml

for example, we run the experiment on Qwen2.5-7B-Instruct model in Japanese with the following settings:

python -m pipeline.run_pipeline --config runs/Qwen2.5-7B-Instruct/ja/ja.yaml

Evaluating the model on multiple languages

# For running multiple language evaluation configurations
python -m scripts.multi_test --config configs/cfg.yaml

for example, we evaluate the Qwen2.5-7B-Instruct model (ablated the refusal direction extracted in Japanese) in Korean language with the following settings:

python -m scripts.multi_test --config output/ja_vector_sweep/Qwen/Qwen2.5-7B-Instruct/ko/20250519-232436/1/ko.yaml

Repository Structure

.
├── PolyRefuse/              # Multilingual safety dataset
├── configs/                 # Configuration files
├── dataset/                 # Dataset loading and processing
├── evaluators/              # Safety evaluators
├── pipeline/                # Main experimental pipeline
│   ├── model_utils/        # Model implementations
│   ├── submodules/         # Pipeline components
│   └── run_pipeline.py     # Main pipeline runner
├── scripts/                 # Utility scripts and experiments
├── utils/                   # Helper utilities
└── requirements.txt        # Python dependencies

Citation

If you use this code or dataset, please cite our paper:

@inproceedings{
wang2025refusal,
title={Refusal Direction is Universal Across Safety-Aligned Languages},
author={Xinpeng Wang and Mingyang Wang and Yihong Liu and Hinrich Schuetze and Barbara Plank},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=eWxKpdAdXH}
}

License

See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Refusal Direction is Universal Across Safety-Aligned Languages

PolyRefuse Dataset

Setup

Installation

Usage

Running Experiments

Refusal Vector Ablation

for example, we run the experiment on Qwen2.5-7B-Instruct model in Japanese with the following settings:

Evaluating the model on multiple languages

for example, we evaluate the Qwen2.5-7B-Instruct model (ablated the refusal direction extracted in Japanese) in Korean language with the following settings:

Repository Structure

Citation

License

Baseline vs English Refusal Vector Ablation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
PolyRefuse		PolyRefuse
bash_scripts		bash_scripts
configs		configs
dataset		dataset
evaluators		evaluators
images		images
output/ja_vector_sweep		output/ja_vector_sweep
pipeline		pipeline
scripts		scripts
tests		tests
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
baseline_vs_harm_ablation.pdf		baseline_vs_harm_ablation.pdf
requirements.txt		requirements.txt
setup.sh		setup.sh

License

mainlp/Multilingual-Refusal

Folders and files

Latest commit

History

Repository files navigation

Refusal Direction is Universal Across Safety-Aligned Languages

PolyRefuse Dataset

Setup

Installation

Usage

Running Experiments

Refusal Vector Ablation

for example, we run the experiment on Qwen2.5-7B-Instruct model in Japanese with the following settings:

Evaluating the model on multiple languages

for example, we evaluate the Qwen2.5-7B-Instruct model (ablated the refusal direction extracted in Japanese) in Korean language with the following settings:

Repository Structure

Citation

License

Baseline vs English Refusal Vector Ablation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages