Xueyang Zhou, Weidong Wang, Lin Lu, Jiawen Shi, Guiyao Tie, Yongtian Xu, Lixing Chen , Pan Zhou, Neil Zhenqiang Gong, Lichao Sun
Large Language Model (LLM)-based agents are increasingly deployed in real-world applications such as "digital assistants, autonomous customer service, and decision-support systems", where their ability to "interact in multi-turn, tool-augmented environments" makes them indispensable. However, ensuring the safety of these agents remains a significant challenge due to the diverse and complex risks arising from dynamic user interactions, external tool usage, and the potential for unintended harmful behaviors. To address this critical issue, we propose AutoSafe, the first framework that systematically enhances agent safety through fully automated synthetic data generation. Concretely, 1) we introduce an open and extensible threat model, OTS, which formalizes how unsafe behaviors emerge from the interplay of user instructions, interaction contexts, and agent actions. This enables precise modeling of safety risks across diverse scenarios. 2) we develop a fully automated data generation pipeline that simulates unsafe user behaviors, applies self-reflective reasoning to generate safe responses, and constructs a large-scale, diverse, and high-quality safety training dataset—eliminating the need for hazardous real-world data collection. To evaluate the effectiveness of our framework, we design comprehensive experiments on both synthetic and real-world safety benchmarks. Results demonstrate that AutoSafe boosts safety scores by 45% on average and achieves a 28.91% improvement on real-world tasks, validating the generalization ability of our learned safety strategies. These results highlight the practical advancement and scalability of AutoSafe in building safer LLM-based agents for real-world deployment.
-
[2025/05/26] 🤖 We open-sourced
AutoSafe! Check out the full codebase and docs in this repository. We’ll continue improving the project with new ideas and updates, so feel free to follow and ⭐️ Star us to stay in the loop! repository. -
[2025/05/26] 🌐 The official AutoSafe website is released: website.
-
[2025/05/26] 🎉 Our paper on AutoSafe has been released to arxiv. Read the preprint here: paper.
Before getting started, make sure you have Python 3.8+ installed.
- Steps to Install Clone the repository:
git clone xxx
cd xxx- Install the project in editable mode:
pip install -e .After installation, you need to configure your API key and base URL for the large language model provider you plan to use. To do this, add the following environment variables:
LLM_API_KEY=[YOUR_API_KEY]
LLM_API_BASE_URL=[YOUR_API_BASE_URL]Replace [YOUR_API_KEY] and [YOUR_API_BASE_URL] with the credentials and endpoint provided by your LLM service provider (e.g., OpenAI, Claude, Azure OpenAI, etc.).
This section describes how to generate synthetic user cases and risky trajectories for training or evaluation.
cd ./gen_casesOpen and configure run_cases.py:
-
Set
model_nameto specify the LLM used for instruction generation. -
Modify
available_toolkitsto define the set of tools and risk combinations.
Run the script to generate user cases:
python run_cases.pyThe generated cases will be saved under the ./cases directory.
(Optional) Clean meaningless or low-quality cases using:
python clean_cases.py-
Configure
agent_model_nameandenv_model_nameinrun_risky_snapshot.py. -
Make sure to specify the correct path to the LLaMA-3.1-70B-Instruct model.
Run the script:
python run_risky_snapshot.pyThe generated risky interaction snapshots will be saved under the ./snapshots directory.
This step extracts safe actions by allowing the agent to interact with the environment and reflect on past risky trajectories.
cd ./reflection.pyOpen and configure the script:
-
dataset_path: Path to the previously generated snapshots.json. -
model_names: Specify the name(s) of the LLMs to be used as the agent's engine for environment interaction and safety reflection. -
model_paths: Set the corresponding file paths for the selected models.
Run the script:
python reflection.pyThe sampled safe actions will be saved under ./data/train/reflection/.
This stage involves preprocessing safety-related data and training the agent using LLaMA-Factory to enhance its safe decision-making capabilities.
cd ./processing.pyCall get_agent_safety_train_data():
-
safety_data_path: Set this to the path where the safety actions from reflection.py were saved. -
train_data_path: Specify the path where the processed safety training data will be stored.
Run the script:
python processing.py get_agent_safety_train_data()Call get_agent_train_data():
-
pretrain_data_path: Path to a pre-sampled dataset for balancing effectiveness and safety (we provide this under a specific folder). -
safety_data_path: Path from the previous step. -
train_data_path: Output path for the combined training data.
Run the script:
python processing.py get_agent_train_data()We use the LLaMA-Factory project to train the model.
- Move the generated training dataset to the LLaMA-Factory data directory:
mv ./train_agent_data.json ../LLaMA-Factory-main/data/
cd ../LLaMA-Factory-main/data/- Edit the dataset_info.json file to include:
"agent_model": {
"file_name": "train_agent_data.json",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output"
}
}- Go to the training configuration folder:
cd ../LLaMA-Factory-main/src/- Configure training arguments in args.py, then start training:
python args.py
CUDA_VISIBLE_DEVICES=0,1,2 llamafactory-cli train train.jsonThis stage evaluates the safety and effectiveness of the trained agent model.
cd ./run.pyConfigure the following parameters:
--agent_model_name: Your model name.
--model_path: Path to the base model.
--adapter_model_path: Path to the trained LoRA adapter.
--dataset_path: Path to the test dataset.
--save_path: Path to save the sampled actions.
Run the sampling script:
python run.pycd ./evaluation.pyConfigure the evaluation function eval_action_with_model():
-
case_action_data_path: Path to the sampled actions (output from step 1). -
model_name: Your model name. -
num_samples: Number of samples for evaluation.
Run the evaluation:
python evaluation.py eval_action_with_model()@misc{zhou2025automatingsafetyenhancementllmbased,
title={Automating Safety Enhancement for LLM-based Agents with Synthetic Risk Scenarios},
author={Xueyang Zhou and Weidong Wang and Lin Lu and Jiawen Shi and Guiyao Tie and Yongtian Xu and Lixing Chen and Pan Zhou and Neil Zhenqiang Gong and Lichao Sun},
year={2025},
eprint={2505.17735},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2505.17735},
}
