EnchTable is a unified framework for transferring safety alignment to fine-tuned large language models without extensive retraining. It combines NTK-based safety vector distillation to extract safety knowledge, and an interference-aware merging strategy to preserve both safety and utility. Evaluated across diverse models and tasks, EnchTable effectively mitigates safety degradation during fine-tuning, maintains high task performance, and shows strong robustness against jailbreak attacks.
We provide 5 model checkpoints aligned via EnchTable across Bio-Medical and Code domains.
| Domain | Method | Checkpoint |
|---|---|---|
| Bio-Medical | Attention | |
| Bio-Medical | FFN | |
| Bio-Medical | Hybrid (Attn+FFN) | |
| Code | FFN | |
| Code | Hybrid (Attn+FFN) |
You can build the required environment by running:
pip install -r requirements.txtThe entire workflow consists of two main stages:
- Safety Distillation
- Merge
For safety distillation, we modify the codebase of LLaMA Factory to implement NTK-constrained fine-tuning. This step aims to extract safety vector from surrogate LLM.
β οΈ Make sure you have properly configured LLaMA Factory before proceeding.
To run the default distillation configuration, simply execute:
bash ./safety_distillation/LLaMA-Factory/train.shAfter training, the harmful surrogate model will be saved into ./safety_distillation/LLaMA-Factory/saves/llama3-8b-beavertail_harmful/attention/sft_ntk_linear_e4
We implement both baseline merging strategies and our proposed interference-aware merging method in ./merge/merge.py.
To run the default merging configuration, simply execute:
bash ./run.shThis script runs the merging process with default hyperparameters (e.g., ./merge/merged_models/Code-Llama-3-8B_aligned.
We evaluate the merged model using multiple benchmarks across different domains:
- Code generation: Evaluated using EvalPlus
- Math reasoning: Evaluated using math-eval-harness
- Medical/general safety: Evaluated using lm-harness
β οΈ Before running the evaluations, please install the required evaluation packages for each benchmark. Below are the installation instructions:
pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
git clone https://github.com/ZubinGou/math-evaluation-harness.git
cd math-evaluation-harness
pip install -r requirements.txt
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .To evaluate the safety of the merged model, follow these steps:
bash eval/scripts/generate_other.sh # Generates responses for general safety evaluation
bash eval/scripts/generate_salad.sh # Generates responses for SALADBench safety evaluationbash eval/scripts/judge_other.sh # Judges general safety
bash eval/scripts/judge_salad.sh # Judges SALADBench safetyThese scripts will output metrics such as Unsafe Rate, which reflect how well the merged model adheres to safety guidelines.
To evaluate the utility (task performance) of the merged model:
bash eval/scripts/evalplus.sh # Evaluates code generation capabilitybash eval/math-evaluation-harness/eval.sh # Evaluates math reasoning capabilitybash eval/scripts/evalharness.sh # Evaluates reasoning capability on medical tasksThese scripts report standard metrics: accuracy.
To comprehensively assess the safety robustness of models under adversarial prompting, we employ two widely used benchmarks: SorryBench and AISafetyLab. These tools simulate real-world "jailbreak" attacks, allowing us to evaluate the effectiveness of our safety alignment framework in extreme scenarios.
cd robustness/sorry-bench
bash generate_answer.sh # Generate model responses
bash judge.sh # Use built-in judge model to detect jailbreak success
β οΈ Note: You need to download the judge model provided by SorryBench and ensure its path is correctly configured.
git clone https://github.com/thu-coai/AISafetyLab.git
cd AISafetyLab
pip install -e .attack_data_path: 'thu-coai/AISafetyLab_Datasets/harmbench_standard'
target_model_path: "YOUR_PATH"
target_model_name: "llama3"
demo_num: 1
lang: en
evaluator_type: "pattern"
evaluator_model_path: "meta-llama/Llama-Guard-3-8B"
detoxify_model_path: "./detoxify_model/toxic_original-c1212f89.ckpt"
detoxify_config_path: "./model_bert"
res_save_path: './results/dra_Code-Llama-3-8B_harmbench_50.jsonl'
device: "cuda:2"
iters: 20
em_t: 0.7attack_data_path: "thu-coai/AISafetyLab_Datasets/harmbench_standard"
target_model_path: "ajibawa-2023/Code-Llama-3-8B"
target_model_name: "llama3"
demo_num: 1
lang: en
evaluator_type: "pattern"
evaluator_path: "meta-llama/Llama-Guard-3-8B"
res_save_path: "./results/ica_Code-Llama-3-8B_harmbench_1shot_PatternScore.jsonl"
device: "cuda:0"python run_attack.py --config configs/dra.yamlIf you find EnchTable useful in your research, please cite our paper:
@article{wu2025enchtable,
title={EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models},
author={Wu, Jialin and Li, Kecen and Huang, Zhicong and Li, Xinfeng and Wang, Xiaofeng and Hong, Cheng},
journal={arXiv preprint arXiv:2511.09880},
year={2025}
}For questions, collaboration, or feedback, feel free to reach out:
π§ jinlin.wjl@antgroup.com or wjlinzju@gmail.com
We welcome contributions and discussions!