SafeAgent: Safeguarding LLM Agents via an Automated Risk Simulator

Xueyang Zhou, Weidong Wang, Lin Lu, Jiawen Shi, Guiyao Tie, Yongtian Xu, Lixing Chen , Pan Zhou, Neil Zhenqiang Gong, Lichao Sun

Large Language Model (LLM)-based agents are increasingly deployed in real-world applications such as "digital assistants, autonomous customer service, and decision-support systems", where their ability to "interact in multi-turn, tool-augmented environments" makes them indispensable. However, ensuring the safety of these agents remains a significant challenge due to the diverse and complex risks arising from dynamic user interactions, external tool usage, and the potential for unintended harmful behaviors. To address this critical issue, we propose AutoSafe, the first framework that systematically enhances agent safety through fully automated synthetic data generation. Concretely, 1) we introduce an open and extensible threat model, OTS, which formalizes how unsafe behaviors emerge from the interplay of user instructions, interaction contexts, and agent actions. This enables precise modeling of safety risks across diverse scenarios. 2) we develop a fully automated data generation pipeline that simulates unsafe user behaviors, applies self-reflective reasoning to generate safe responses, and constructs a large-scale, diverse, and high-quality safety training dataset—eliminating the need for hazardous real-world data collection. To evaluate the effectiveness of our framework, we design comprehensive experiments on both synthetic and real-world safety benchmarks. Results demonstrate that AutoSafe boosts safety scores by 45% on average and achieves a 28.91% improvement on real-world tasks, validating the generalization ability of our learned safety strategies. These results highlight the practical advancement and scalability of AutoSafe in building safer LLM-based agents for real-world deployment.

✨ News ✨

[2025/05/26] 🤖 We open-sourced AutoSafe! Check out the full codebase and docs in this repository. We’ll continue improving the project with new ideas and updates, so feel free to follow and ⭐️ Star us to stay in the loop! repository.
[2025/05/26] 🌐 The official AutoSafe website is released: website.
[2025/05/26] 🎉 Our paper on AutoSafe has been released to arxiv. Read the preprint here: paper.

Setup

🚀 Installation Guide

Before getting started, make sure you have Python 3.8+ installed.

Steps to Install Clone the repository:

git clone xxx
cd xxx

Install the project in editable mode:

pip install -e .

Set up API keys

After installation, you need to configure your API key and base URL for the large language model provider you plan to use. To do this, add the following environment variables:

LLM_API_KEY=[YOUR_API_KEY]
LLM_API_BASE_URL=[YOUR_API_BASE_URL]

Replace [YOUR_API_KEY] and [YOUR_API_BASE_URL] with the credentials and endpoint provided by your LLM service provider (e.g., OpenAI, Claude, Azure OpenAI, etc.).

Quick Start

🧪 Data synthesis

This section describes how to generate synthetic user cases and risky trajectories for training or evaluation.

1. User Instruction Generation

cd ./gen_cases

Open and configure run_cases.py:

Set model_name to specify the LLM used for instruction generation.
Modify available_toolkits to define the set of tools and risk combinations.

Run the script to generate user cases:

python run_cases.py

The generated cases will be saved under the ./cases directory.

(Optional) Clean meaningless or low-quality cases using:

python clean_cases.py

2. Risky Trajectory Generation

Configure agent_model_name and env_model_name in run_risky_snapshot.py.
Make sure to specify the correct path to the LLaMA-3.1-70B-Instruct model.

Run the script:

python run_risky_snapshot.py

The generated risky interaction snapshots will be saved under the ./snapshots directory.

🛡️ Safety Action Sample

This step extracts safe actions by allowing the agent to interact with the environment and reflect on past risky trajectories.

1. Steps:

cd ./reflection.py

Open and configure the script:

dataset_path: Path to the previously generated snapshots.json.
model_names: Specify the name(s) of the LLMs to be used as the agent's engine for environment interaction and safety reflection.
model_paths: Set the corresponding file paths for the selected models.

Run the script:

python reflection.py

The sampled safe actions will be saved under ./data/train/reflection/.

🔒 Safety Enhancement Training

This stage involves preprocessing safety-related data and training the agent using LLaMA-Factory to enhance its safe decision-making capabilities.

1. Data Preprocessing

cd ./processing.py

Call get_agent_safety_train_data():

safety_data_path: Set this to the path where the safety actions from reflection.py were saved.
train_data_path: Specify the path where the processed safety training data will be stored.

Run the script:

python processing.py get_agent_safety_train_data()

Call get_agent_train_data():

pretrain_data_path: Path to a pre-sampled dataset for balancing effectiveness and safety (we provide this under a specific folder).
safety_data_path: Path from the previous step.
train_data_path: Output path for the combined training data.

Run the script:

python processing.py get_agent_train_data()

2. Train the Model

We use the LLaMA-Factory project to train the model.

Move the generated training dataset to the LLaMA-Factory data directory:

mv ./train_agent_data.json ../LLaMA-Factory-main/data/
cd ../LLaMA-Factory-main/data/

Edit the dataset_info.json file to include:

"agent_model": {
  "file_name": "train_agent_data.json",
  "columns": {
    "prompt": "instruction",
    "query": "input",
    "response": "output"
  }
}

Go to the training configuration folder:

cd ../LLaMA-Factory-main/src/

Configure training arguments in args.py, then start training:

python args.py
CUDA_VISIBLE_DEVICES=0,1,2 llamafactory-cli train train.json

✅ Evaluation

This stage evaluates the safety and effectiveness of the trained agent model.

1. Action Sampling

cd ./run.py

Configure the following parameters:

--agent_model_name: Your model name.

--model_path: Path to the base model.

--adapter_model_path: Path to the trained LoRA adapter.

--dataset_path: Path to the test dataset.

--save_path: Path to save the sampled actions.

Run the sampling script:

python run.py

2. Safety Evaluation

cd ./evaluation.py

Configure the evaluation function eval_action_with_model():

case_action_data_path: Path to the sampled actions (output from step 1).
model_name: Your model name.
num_samples: Number of samples for evaluation.

Run the evaluation:

python evaluation.py eval_action_with_model()

Citation

@misc{zhou2025automatingsafetyenhancementllmbased,
      title={Automating Safety Enhancement for LLM-based Agents with Synthetic Risk Scenarios},
      author={Xueyang Zhou and Weidong Wang and Lin Lu and Jiawen Shi and Guiyao Tie and Yongtian Xu and Lixing Chen and Pan Zhou and Neil Zhenqiang Gong and Lichao Sun},
      year={2025},
      eprint={2505.17735},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.17735},
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
data/tools		data/tools
figures		figures
gen_cases		gen_cases
prompts		prompts
AutoSafe.yml		AutoSafe.yml
README.md		README.md
__init__.py		__init__.py
chat_model.py		chat_model.py
evaluation.py		evaluation.py
generation.py		generation.py
main.py		main.py
model_calling.py		model_calling.py
processing.py		processing.py
requirements.txt		requirements.txt
run.py		run.py
sample.py		sample.py
sample_risky_action.py		sample_risky_action.py
sample_risky_expert.py		sample_risky_expert.py
sample_safety_action.py		sample_safety_action.py
test_reflection.py		test_reflection.py
untils.py		untils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SafeAgent: Safeguarding LLM Agents via an Automated Risk Simulator

✨ News ✨

Setup

🚀 Installation Guide

Set up API keys

Quick Start

🧪 Data synthesis

1. User Instruction Generation

2. Risky Trajectory Generation

🛡️ Safety Action Sample

1. Steps:

🔒 Safety Enhancement Training

1. Data Preprocessing

2. Train the Model

✅ Evaluation

1. Action Sampling

2. Safety Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Zxy-MLlab/AutoSafe

Folders and files

Latest commit

History

Repository files navigation

SafeAgent: Safeguarding LLM Agents via an Automated Risk Simulator

✨ News ✨

Setup

🚀 Installation Guide

Set up API keys

Quick Start

🧪 Data synthesis

1. User Instruction Generation

2. Risky Trajectory Generation

🛡️ Safety Action Sample

1. Steps:

🔒 Safety Enhancement Training

1. Data Preprocessing

2. Train the Model

✅ Evaluation

1. Action Sampling

2. Safety Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages