Generative AI Security Papers:

Large Language/Reasoning Models Safety

Survey | Measurements | Benchmarks

<<<<<<< HEAD

Paper	Venus	PDF	Code
The Digital Cybersecurity Expert: How Far Have We Come?	Venus	PDF	-
Safety in Large Reasoning Models: A Survey	Venus	PDF	-
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models	Venus	PDF	-
From System 1 to System 2: A Survey of Reasoning Large Language Models	Venus	PDF	-
Safety at Scale: A Comprehensive Survey of Large Model Safety	Venus	PDF	-
Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies	Venus	PDF	-
Reasoning Models Don't Always Say What They Think	-	PDF	-
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1	Venus	PDF	-
SAFECHAIN: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities	Venus	PDF	-
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning	Venus	PDF	-
o3-mini vs DeepSeek-R1: Which One is Safer?	Venus	PDF	-
Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings	Venus	PDF	-
Safety Evaluation of DeepSeek Models in Chinese Contexts	Venus	PDF	-
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models	Venus	PDF	-
Are Smarter LLMs Safer? Exploring Safety-Reasoning Trade-offs in Prompting and Fine-Tuning	Venus	PDF	-
=======

The Digital Cybersecurity Expert: How Far Have We Come? https://arxiv.org/pdf/2504.11783
Safety in Large Reasoning Models: A Survey https://arxiv.org/pdf/2504.17704
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models https://arxiv.org/pdf/2503.16419
From System 1 to System 2: A Survey of Reasoning Large Language Models https://arxiv.org/pdf/2502.17419
Safety at Scale: A Comprehensive Survey of Large Model Safety https://arxiv.org/pdf/2502.05206
Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies https://arxiv.org/pdf/2501.17030v1
Reasoning Models Don’t Always Say What They Think https://assets.anthropic.com/m/71876fabef0f0ed4/original/reasoning_models_paper.pdf
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 https://arxiv.org/pdf/2502.12659
SAFECHAIN: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities https://arxiv.org/pdf/2502.12025
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning https://arxiv.org/pdf/2504.07128
o3-mini vs DeepSeek-R1: Which One is Safer? https://arxiv.org/pdf/2501.18438
Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings https://arxiv.org/pdf/2503.15092
Safety Evaluation of DeepSeek Models in Chinese Contexts https://arxiv.org/pdf/2502.11137
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models https://arxiv.org/pdf/2504.08813
Are Smarter LLMs Safer? Exploring Safety-Reasoning Trade-offs in Prompting and Fine-Tuning https://arxiv.org/pdf/2502.09673

0f89124a4ab688177c1591bce25e9dc0816480ef

Attacks

Paper	Venus	PDF	Code
OverThink: Slowdown Attacks on Reasoning LLMs	Venus	PDF	-
Trading Inference-Time Compute for Adversarial Robustness	Venus	PDF	-
A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos	Venus	PDF	-
H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models	Venus	PDF	-
Process or Result? Manipulated Ending Tokens Can Mislead Reasoning LLMs to Ignore the Correct Reasoning Steps	Venus	PDF	-
ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs	Venus	PDF	-
BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack	Venus	PDF	-
DarkMind: Latent Chain-of-Thought Backdoor in Customized LLMs	Venus	PDF	-
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models	Venus	PDF	-
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models	Venus	PDF	-
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues	Venus	PDF	-
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet	Venus	PDF	-

Defenses

<<<<<<< HEAD

Paper	Venus	PDF	Code
STAR-1: Safer Alignment of Reasoning LLMs with 1K Data	Venus	PDF	-
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability	Venus	PDF	-
Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment	Venus	PDF	-
Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning	Venus	PDF	-
=======

STAR-1: Safer Alignment of Reasoning LLMs with 1K Data https://arxiv.org/pdf/2504.01903
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability https://arxiv.org/pdf/2504.10081
Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment https://arxiv.org/pdf/2502.12025
Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning https://arxiv.org/pdf/2501.19180

Code Generation Security

Survey | Measurements | Benchmarks

[SP2025] SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis https://www.computer.org/csdl/proceedings-article/sp/2025/223600c791/26hiV8eg35u
Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask https://arxiv.org/pdf/2504.13474
CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities https://arxiv.org/abs/2503.17332
How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs
[Neurips 2024] RedCode: Risky Code Execution and Generation Benchmark for Code Agents https://proceedings.neurips.cc/paper_files/paper/2024/file/bfd082c452dffb450d5a5202b0419205-Paper-Datasets_and_Benchmarks_Track.pdf
AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents https://arxiv.org/abs/2502.05957

Attack

Defense

[CCS 2023] Large Language Models for Code: Security Hardening and Adversarial Testing https://arxiv.org/pdf/2302.05319
[CCS 2024] PromSec: Prompt Optimization for Secure Generation of Functional Source Code with Large Language Models (LLMs) https://dl.acm.org/doi/10.1145/3658644.3690298

Media | Reports | Tools

Meta CodeShield https://github.com/meta-llama/PurpleLlama/tree/main/CodeShield

0f89124a4ab688177c1591bce25e9dc0816480ef

AI Agent Security

Survey | Measurements | Benchmarks

<<<<<<< HEAD

Paper	Venus	PDF	Code
AGENT-SAFETYBENCH: Evaluating the Safety of LLM Agents	Venus	PDF	-
A Survey on Trustworthy LLM Agents: Threats and Countermeasures	Venus	PDF	-
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents	Venus	PDF	-
AI Agents Under Threat: A Survey of Key Security Challenges	Venus	PDF	-
Emerging Cyber Attack Risks of Medical AI Agents	Venus	PDF	-
Agentdojo: A dynamic environment to evaluate attacks and defenses for llm agents	Venus	PDF	-
Formalizing and benchmarking attacks and defenses in llm-based agents	Venus	PDF	-
Nuclear Deployed: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents	Venus	PDF	-
RedCode: Risky Code Execution and Generation Benchmark for Code Agents	-	PDF	-
CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities	Venus	PDF	-
Security of AI Agents	Venus	PDF	-
Navigating the Risks: A Survey of Security, Privacy, and Ethics Threats in LLM-Based Agents	Venus	PDF	-

Attacks

Paper	Venus	PDF	Code
UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning	Venus	PDF	-
Towards Action Hijacking of Large Language Model-based Agent	Venus	PDF	-

Defenses

Paper	Venus	PDF	Code
SHIELDAGENT: Shielding Agents via Verifiable Safety Policy Reasoning	Venus	PDF	-
Defining and Detecting the Defects of the Large Language Model-based Autonomous Agents	Venus	PDF	-
PentestAgent: Incorporating LLM Agents to Automated Penetration Testing	Venus	PDF	-
=======

AGENT-SAFETYBENCH: Evaluating the Safety of LLM Agents https://arxiv.org/pdf/2412.14470
A Survey on Trustworthy LLM Agents: Threats and Countermeasures https://arxiv.org/pdf/2503.09648
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents https://arxiv.org/pdf/2401.10019
AI Agents Under Threat: A Survey of Key Security Challenges https://arxiv.org/pdf/2406.02630
Emerging Cyber Attack Risks of Medical AI Agents https://arxiv.org/pdf/2504.03759
Agentdojo: A dynamic environment to evaluate attacks and defenses for llm agents https://arxiv.org/pdf/2406.13352
Formalizing and benchmarking attacks and defenses in llm-based agents https://arxiv.org/abs/2410.02644
Nuclear Deployed: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents https://arxiv.org/abs/2502.11355
RedCode: Risky Code Execution and Generation Benchmark for Code Agents https://proceedings.neurips.cc/paper_files/paper/2024/file/bfd082c452dffb450d5a5202b0419205-Paper-Datasets_and_Benchmarks_Track.pdf
CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities https://arxiv.org/abs/2503.17332
Security of AI Agents https://arxiv.org/pdf/2406.08689
Navigating the Risks: A Survey of Security, Privacy, and Ethics Threats in LLM-Based Agents https://arxiv.org/pdf/2411.09523
Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions https://arxiv.org/pdf/2503.23278
AGENT-SAFETYBENCH: Evaluating the Safety of LLM Agents https://arxiv.org/pdf/2412.14470

Attacks

UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning https://arxiv.org/abs/2503.01908
Towards Action Hijacking of Large Language Model-based Agent https://arxiv.org/pdf/2412.10807

Defenses

SHIELDAGENT: Shielding Agents via Verifiable Safety Policy Reasoning https://arxiv.org/pdf/2503.22738
Defining and Detecting the Defects of the Large Language Model-based Autonomous Agents https://arxiv.org/pdf/2412.18371
[AsianCCS 2025] PentestAgent: Incorporating LLM Agents to Automated Penetration Testing https://arxiv.org/pdf/2411.05185

Media | Reports | Tools

LlamaFirewall: An open source guardrail system for building secure AI agents https://ai.meta.com/research/publications/llamafirewall-an-open-source-guardrail-system-for-building-secure-ai-agents/
Introducing the Model Context Protocol https://www.anthropic.com/news/model-context-protocol
Model Context Protocol https://github.com/modelcontextprotocol

0f89124a4ab688177c1591bce25e9dc0816480ef

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generative AI Security Papers:

Large Language/Reasoning Models Safety

Survey | Measurements | Benchmarks

Attacks

Defenses

Code Generation Security

Survey | Measurements | Benchmarks

Attack

Defense

Media | Reports | Tools

AI Agent Security

Survey | Measurements | Benchmarks

Attacks

Defenses

Attacks

Defenses

Media | Reports | Tools

About

Uh oh!

Releases

Packages

UBSec/AI-Security-Papers

Folders and files

Latest commit

History

Repository files navigation

Generative AI Security Papers:

Large Language/Reasoning Models Safety

Survey | Measurements | Benchmarks

Attacks

Defenses

Code Generation Security

Survey | Measurements | Benchmarks

Attack

Defense

Media | Reports | Tools

AI Agent Security

Survey | Measurements | Benchmarks

Attacks

Defenses

Attacks

Defenses

Media | Reports | Tools

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages