fix: prevent conversation history corruption when multiple tool calls require approval#5294
fix: prevent conversation history corruption when multiple tool calls require approval#5294gyliu513 wants to merge 1 commit intollamastack:mainfrom
Conversation
leseb
left a comment
There was a problem hiding this comment.
please add a repro script and a test plan
… require approval Fix a bug in _separate_tool_calls() where next_turn_messages.pop() was called inside the inner for tool_call loop, causing it to execute once per tool call needing approval instead of once per choice. When multiple tool calls in a single choice required approval, this deleted unrelated messages from the conversation history. The fix is replace the inline pop() calls with a should_remove_assistant_msg flag that is evaluated once after the inner loop completes, ensuring only the assistant message is removed.
|
@leseb here are some detailed steps for reproduce against main branch and how the fix works. Test program#!/usr/bin/env python3
# scripts/repro_5294.py
"""Reproduce issue #5293: conversation history corruption when multiple
MCP tool calls require approval in a single model response.
This script connects to a running llama-stack server, starts a local MCP
server with 5 tools (require_approval="always"), and sends a multi-turn
conversation that triggers 2 parallel tool calls needing approval.
Prerequisites:
1. Start a llama-stack server:
OPENAI_API_KEY=sk-... llama stack run starter --port 8321 \\
--providers "inference=openai,responses=builtin"
2. Run this script:
uv run python scripts/repro_5294.py --model openai/gpt-4o-mini
Usage:
uv run python scripts/repro_5294.py --url http://localhost:8321 --model openai/gpt-4o-mini
"""
import argparse
import sys
sys.path.insert(0, ".")
from tests.common.mcp import dependency_tools, make_mcp_server
def print_output_items(response):
"""Pretty-print response output items."""
for i, item in enumerate(response.output):
item_type = item.type
if item_type == "mcp_list_tools":
tool_names = [t.name for t in item.tools]
print(f" [{i}] mcp_list_tools: {tool_names}")
elif item_type == "mcp_approval_request":
print(f" [{i}] mcp_approval_request: {item.name}({item.arguments})")
elif item_type == "mcp_call":
print(f" [{i}] mcp_call: {item.name}({item.arguments}) -> {item.output}")
elif item_type == "message":
text = item.content[0].text if item.content else ""
preview = text[:200].replace("\n", " ")
print(f" [{i}] message: {preview}")
else:
print(f" [{i}] {item_type}")
def run_demo(client, model, mcp_server_url):
tools = [
{
"type": "mcp",
"server_label": "localmcp",
"server_url": mcp_server_url,
"require_approval": "always",
}
]
# Multi-turn conversation with 3 messages.
# The bug deletes messages from this history when processing multiple
# approval-pending tool calls, causing the model to lose context.
input_messages = [
{"role": "user", "content": "Hi, I need help looking up some users."},
{"role": "assistant", "content": "Sure! Which users do you need me to look up?"},
{
"role": "user",
"content": (
'Look up the user IDs for both "alice" and "bob" at the same time. '
"Call get_user_id for both users in parallel - do NOT call one and wait, "
"call both at once."
),
},
]
print()
print("=" * 70)
print("Step 1: Send multi-turn conversation to trigger multiple tool calls")
print("=" * 70)
print(f" Model: {model}")
print(f" require_approval: always")
print(f" Input messages ({len(input_messages)}):")
for i, m in enumerate(input_messages):
print(f" [{i}] {m['role']}: {m['content']}")
print()
response = client.responses.create(
model=model,
input=input_messages,
tools=tools,
stream=False,
)
print(f"Response ID: {response.id}")
print(f"Output items ({len(response.output)}):")
print_output_items(response)
approval_requests = [o for o in response.output if o.type == "mcp_approval_request"]
if len(approval_requests) >= 2:
print()
print(f"Got {len(approval_requests)} approval requests.")
print()
print("=" * 70)
print("Step 2: Approve all tool calls and continue")
print("=" * 70)
print()
approval_inputs = []
for req in approval_requests:
print(f" Approving: {req.name}({req.arguments})")
approval_inputs.append(
{
"type": "mcp_approval_response",
"approval_request_id": req.id,
"approve": True,
}
)
print()
response2 = client.responses.create(
previous_response_id=response.id,
model=model,
input=approval_inputs,
tools=tools,
stream=False,
)
print(f"Response ID: {response2.id}")
print(f"Output items ({len(response2.output)}):")
print_output_items(response2)
# Analyze result
mcp_calls = [o for o in response2.output if o.type == "mcp_call"]
messages = [o for o in response2.output if o.type == "message"]
print()
print("=" * 70)
print("Result")
print("=" * 70)
if mcp_calls:
print(f" Tools executed: {len(mcp_calls)}")
for call in mcp_calls:
print(f" {call.name}({call.arguments}) -> {call.output}")
if messages:
print(f" Final answer: {messages[-1].content[0].text[:200]}")
print()
print(" PASS: Tools executed and correct answer returned.")
else:
if messages:
print(f" Model response: {messages[-1].content[0].text[:200]}")
print()
print(" FAIL: No tools were executed after approval.")
print(" The model lost conversation context due to history corruption.")
elif len(approval_requests) == 1:
print()
print("NOTE: Model only generated 1 tool call (need >= 2 to trigger the bug).")
print("Try re-running the script.")
else:
print()
print("NOTE: No approval requests found. Try re-running the script.")
def main():
parser = argparse.ArgumentParser(
description="Reproduce issue #5293: conversation history corruption"
)
parser.add_argument(
"--url",
default="http://localhost:8321",
help="URL of the llama-stack server (default: http://localhost:8321)",
)
parser.add_argument(
"--model",
default="openai/gpt-4o-mini",
help="Model ID to use (default: openai/gpt-4o-mini)",
)
args = parser.parse_args()
from llama_stack_client import LlamaStackClient
print(f"Connecting to {args.url}")
client = LlamaStackClient(base_url=args.url)
print("Starting MCP test server with dependency_tools (5 tools)...")
with make_mcp_server(tools=dependency_tools()) as mcp_server_info:
server_url = mcp_server_info["server_url"]
print(f"MCP server ready at {server_url}")
run_demo(client, args.model, server_url)
if __name__ == "__main__":
main()Test OverviewBug: Test program: Part 1: Running WITHOUT the fix (
|
main (without fix) |
assist (with fix) |
|
|---|---|---|
| Step 2 output items | 2 (mcp_list_tools + message) | 4 (mcp_list_tools + 2 mcp_call + message) |
| Tools executed? | No | Yes — both get_user_id calls |
| Model response | "Please provide the usernames you'd like to look up" | "Alice: user_12345, Bob: user_67890" |
| Result | FAIL — model lost context, asked again | PASS — correct answer |
Why it fails on main: When _separate_tool_calls() processes the 2 approval-pending tool calls in Step 1, pop() runs twice. The first pop removes the assistant message (correct). The second pop removes msg[2] — the user's request "Look up alice and bob" (bug). When Step 2 reconstructs the conversation, the user's request is gone, so the model only sees "Hi, I need help" + "Sure! Which users?" and asks again.
Why it passes on assist: The fix replaces the inline pop() calls with a should_remove_assistant_msg flag. After the loop, pop() runs exactly once, removing only the assistant message. All 3 original messages are preserved.
|
Hi @gyliu513, sorry I missed this PR from you and also wound up raising #5303 solving the same bug #5303 tracks which tool calls were executed and recreates a new message only containing those calls so that the tool call history is preserved, which I think might be a slightly better approach For that reason I think one of us should close our PRs - since you got to the bug first, if you agree with the approach in #5303 I'm happy to close the PR and we can have this one updated |
|
Good point, thanks @jaideepr97 , I think you fix is good, let me close this PR and we can follow up on your PR. |
Fixed #5293
Fix a bug in _separate_tool_calls() where next_turn_messages.pop() was called inside the inner for tool_call loop, causing it to execute once per tool call needing approval instead of once per choice. When multiple tool calls in a single choice required approval, this deleted unrelated messages from the conversation history.
The fix is replace the inline pop() calls with a should_remove_assistant_msg flag that is evaluated once after the inner loop completes, ensuring only the assistant message is removed.
What does this PR do?
Test Plan