The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models (arxiv link)
-
2024-11-30 🎉 This work has been accepted by COLING 2025.
-
2024-8-17
⚠️ We have confirmed that this attack is effective against GPTs (By configuring the action[1], a feature based on function calling). A jailbreaked GPTs removes the barrier to jailbreaking for non-technical users, significantly increasing the risk of AI misuse. Therefore, we have decided not to make the jailbreaked GPTs public.
-
2024-8-14 🎉 The variants [1] and [2] of our work won the ChatGPTJailbreak community's monthly featured jailbreak for August.
-
2024-8-8 🚀🚀🚀 Our work is featured in MarkTechPost. Thanks for attention.
- Zihui Wu (zihui@stu.xidian.edu.cn)
- Haichang Gao (hchgao@xidian.edu.cn)
- Jianping He (jianpinghe37@gmail.com)
- Ping Wang (pingwangyy@foxmail.com)
- School of Computer Science and Technology, Xidian University
Large language models (LLMs) have demonstrated remarkable capabilities, but their power comes with significant security considerations. While extensive research has been conducted on the safety of LLMs in chat mode, the security implications of their function calling feature have been largely overlooked. This paper uncovers a critical vulnerability in the function calling process of LLMs, introducing a novel "jailbreak function" attack method that exploits alignment discrepancies, user coercion, and the absence of rigorous safety filters. Our empirical study, conducted on six state-of-the-art LLMs including GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-pro, reveals an alarming average success rate of over 90% for this attack. We provide a comprehensive analysis of why function calls are susceptible to such attacks and propose defensive strategies, including the use of defensive prompts. Our findings highlight the urgent need for enhanced security measures in the function calling capabilities of LLMs, contributing to the field of AI safety by identifying a previously unexplored risk, designing an effective attack method, and suggesting practical defensive measures.
Large language models (LLMs) have exhibited remarkable capabilities in generating coherent and contextually relevant responses across various domains. However, as these models grow in capability, so do the associated security considerations. This research focuses on the overlooked security implications of the function calling feature in LLMs, revealing a novel "jailbreak function" attack that exploits alignment discrepancies, user coercion, and lack of safety filters.
We introduce a novel jailbreak function called WriteNovel designed to induce harmful content generation during function calls. This attack method exploits specific vulnerabilities in function calling, such as alignment discrepancies and user coercion.
We evaluated the jailbreak function attack on six state-of-the-art LLMs, revealing an alarming average attack success rate of over 90%. Our analysis identified three primary reasons for the success of jailbreak function attacks:
- Alignment discrepancies where function arguments are less aligned with safety standards compared to chat mode responses.
- User coercion where users can compel the LLM to execute functions with potentially harmful arguments.
- Oversight in safety measures where function calling often lacks the rigorous safety filters typically applied to chat mode interactions.
We propose several defensive measures against jailbreak function attacks:
- Restricting User Permissions: Limiting the ability of users to mandate function execution.
- Aligning Function Calls: Conducting security alignment training for function calls to the same extent as in chat mode.
- Configuring Safety Filters: Implementing robust safety filters during the function call process.
- Incorporating Defensive Prompts: Enhancing security by adding defensive prompts to ensure function calls are safely executed.
This paper identifies a critical yet previously overlooked security vulnerability in LLMs: the potential for jailbreaking through function calling. Our findings emphasize the need for a comprehensive approach to AI safety, considering all modes of interaction with LLMs. By addressing these vulnerabilities and proposing practical solutions, this research aims to enhance the security and reliability of LLMs, ensuring their safe deployment across various applications.
- Clone the repository:
git clone https://github.com/wooozihui/jailbreakfunction
- Replace the default API key with yours in
attack.py. - Run the attack using
python attack.py --target_model model_name.
If you find this project helpful, please give it a star!

