Skip to content

fix: prevent conversation history corruption when multiple tool calls require approval#5294

Closed
gyliu513 wants to merge 1 commit intollamastack:mainfrom
gyliu513:assist
Closed

fix: prevent conversation history corruption when multiple tool calls require approval#5294
gyliu513 wants to merge 1 commit intollamastack:mainfrom
gyliu513:assist

Conversation

@gyliu513
Copy link
Copy Markdown
Contributor

Fixed #5293

Fix a bug in _separate_tool_calls() where next_turn_messages.pop() was called inside the inner for tool_call loop, causing it to execute once per tool call needing approval instead of once per choice. When multiple tool calls in a single choice required approval, this deleted unrelated messages from the conversation history.

The fix is replace the inline pop() calls with a should_remove_assistant_msg flag that is evaluated once after the inner loop completes, ensuring only the assistant message is removed.

What does this PR do?

Test Plan

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 25, 2026
Copy link
Copy Markdown
Collaborator

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a repro script and a test plan

… require approval

Fix a bug in _separate_tool_calls() where next_turn_messages.pop() was called
inside the inner for tool_call loop, causing it to execute once per tool call
needing approval instead of once per choice. When multiple tool calls in a
single choice required approval, this deleted unrelated messages from the
conversation history.

The fix is replace the inline pop() calls with a should_remove_assistant_msg
flag that is evaluated once after the inner loop completes, ensuring only
the assistant message is removed.
@gyliu513
Copy link
Copy Markdown
Contributor Author

@leseb here are some detailed steps for reproduce against main branch and how the fix works.

Test program

#!/usr/bin/env python3
# scripts/repro_5294.py

"""Reproduce issue #5293: conversation history corruption when multiple
MCP tool calls require approval in a single model response.

This script connects to a running llama-stack server, starts a local MCP
server with 5 tools (require_approval="always"), and sends a multi-turn
conversation that triggers 2 parallel tool calls needing approval.

Prerequisites:
    1. Start a llama-stack server:
       OPENAI_API_KEY=sk-... llama stack run starter --port 8321 \\
           --providers "inference=openai,responses=builtin"

    2. Run this script:
       uv run python scripts/repro_5294.py --model openai/gpt-4o-mini

Usage:
    uv run python scripts/repro_5294.py --url http://localhost:8321 --model openai/gpt-4o-mini
"""

import argparse
import sys

sys.path.insert(0, ".")
from tests.common.mcp import dependency_tools, make_mcp_server


def print_output_items(response):
    """Pretty-print response output items."""
    for i, item in enumerate(response.output):
        item_type = item.type
        if item_type == "mcp_list_tools":
            tool_names = [t.name for t in item.tools]
            print(f"  [{i}] mcp_list_tools: {tool_names}")
        elif item_type == "mcp_approval_request":
            print(f"  [{i}] mcp_approval_request: {item.name}({item.arguments})")
        elif item_type == "mcp_call":
            print(f"  [{i}] mcp_call: {item.name}({item.arguments}) -> {item.output}")
        elif item_type == "message":
            text = item.content[0].text if item.content else ""
            preview = text[:200].replace("\n", " ")
            print(f"  [{i}] message: {preview}")
        else:
            print(f"  [{i}] {item_type}")


def run_demo(client, model, mcp_server_url):
    tools = [
        {
            "type": "mcp",
            "server_label": "localmcp",
            "server_url": mcp_server_url,
            "require_approval": "always",
        }
    ]

    # Multi-turn conversation with 3 messages.
    # The bug deletes messages from this history when processing multiple
    # approval-pending tool calls, causing the model to lose context.
    input_messages = [
        {"role": "user", "content": "Hi, I need help looking up some users."},
        {"role": "assistant", "content": "Sure! Which users do you need me to look up?"},
        {
            "role": "user",
            "content": (
                'Look up the user IDs for both "alice" and "bob" at the same time. '
                "Call get_user_id for both users in parallel - do NOT call one and wait, "
                "call both at once."
            ),
        },
    ]

    print()
    print("=" * 70)
    print("Step 1: Send multi-turn conversation to trigger multiple tool calls")
    print("=" * 70)
    print(f"  Model: {model}")
    print(f"  require_approval: always")
    print(f"  Input messages ({len(input_messages)}):")
    for i, m in enumerate(input_messages):
        print(f"    [{i}] {m['role']}: {m['content']}")
    print()

    response = client.responses.create(
        model=model,
        input=input_messages,
        tools=tools,
        stream=False,
    )

    print(f"Response ID: {response.id}")
    print(f"Output items ({len(response.output)}):")
    print_output_items(response)

    approval_requests = [o for o in response.output if o.type == "mcp_approval_request"]

    if len(approval_requests) >= 2:
        print()
        print(f"Got {len(approval_requests)} approval requests.")
        print()
        print("=" * 70)
        print("Step 2: Approve all tool calls and continue")
        print("=" * 70)
        print()

        approval_inputs = []
        for req in approval_requests:
            print(f"  Approving: {req.name}({req.arguments})")
            approval_inputs.append(
                {
                    "type": "mcp_approval_response",
                    "approval_request_id": req.id,
                    "approve": True,
                }
            )

        print()
        response2 = client.responses.create(
            previous_response_id=response.id,
            model=model,
            input=approval_inputs,
            tools=tools,
            stream=False,
        )

        print(f"Response ID: {response2.id}")
        print(f"Output items ({len(response2.output)}):")
        print_output_items(response2)

        # Analyze result
        mcp_calls = [o for o in response2.output if o.type == "mcp_call"]
        messages = [o for o in response2.output if o.type == "message"]

        print()
        print("=" * 70)
        print("Result")
        print("=" * 70)
        if mcp_calls:
            print(f"  Tools executed: {len(mcp_calls)}")
            for call in mcp_calls:
                print(f"    {call.name}({call.arguments}) -> {call.output}")
            if messages:
                print(f"  Final answer: {messages[-1].content[0].text[:200]}")
            print()
            print("  PASS: Tools executed and correct answer returned.")
        else:
            if messages:
                print(f"  Model response: {messages[-1].content[0].text[:200]}")
            print()
            print("  FAIL: No tools were executed after approval.")
            print("  The model lost conversation context due to history corruption.")

    elif len(approval_requests) == 1:
        print()
        print("NOTE: Model only generated 1 tool call (need >= 2 to trigger the bug).")
        print("Try re-running the script.")
    else:
        print()
        print("NOTE: No approval requests found. Try re-running the script.")


def main():
    parser = argparse.ArgumentParser(
        description="Reproduce issue #5293: conversation history corruption"
    )
    parser.add_argument(
        "--url",
        default="http://localhost:8321",
        help="URL of the llama-stack server (default: http://localhost:8321)",
    )
    parser.add_argument(
        "--model",
        default="openai/gpt-4o-mini",
        help="Model ID to use (default: openai/gpt-4o-mini)",
    )
    args = parser.parse_args()

    from llama_stack_client import LlamaStackClient

    print(f"Connecting to {args.url}")
    client = LlamaStackClient(base_url=args.url)

    print("Starting MCP test server with dependency_tools (5 tools)...")
    with make_mcp_server(tools=dependency_tools()) as mcp_server_info:
        server_url = mcp_server_info["server_url"]
        print(f"MCP server ready at {server_url}")
        run_demo(client, args.model, server_url)


if __name__ == "__main__":
    main()

Test Overview

Bug: _separate_tool_calls() in streaming.py calls next_turn_messages.pop() inside the inner for tool_call loop. When a model response contains 2+ MCP tool calls needing approval, pop() runs once per tool call instead of once total, deleting unrelated messages from conversation history.

Test program: scripts/repro_5294.py: connects to a running llama-stack server, starts a local MCP server with 5 tools (require_approval="always"), sends a 3-message multi-turn conversation designed to trigger 2 parallel get_user_id calls.


Part 1: Running WITHOUT the fix (main branch)

Step 1.1 — Check out main branch and start the server:

git checkout main
OPENAI_API_KEY=sk-... llama stack run starter --port 8321 \
    --providers "inference=openai,responses=builtin"

Step 1.2 — Run the test program (from a second terminal):

uv run python scripts/repro_5294.py --url http://localhost:8321 --model openai/gpt-4o-mini

Step 1.3 — Output:

======================================================================
Step 1: Send multi-turn conversation to trigger multiple tool calls
======================================================================
  Model: openai/gpt-4o-mini
  require_approval: always
  Input messages (3):
    [0] user: Hi, I need help looking up some users.
    [1] assistant: Sure! Which users do you need me to look up?
    [2] user: Look up the user IDs for both "alice" and "bob" at the same time. ...

Response ID: resp_b9ae37e2-9f7e-4d23-9904-19457b29d698
Output items (3):
  [0] mcp_list_tools: ['get_user_id', 'get_user_permissions', ...]
  [1] mcp_approval_request: get_user_id({"username": "alice"})
  [2] mcp_approval_request: get_user_id({"username": "bob"})

Got 2 approval requests.

======================================================================
Step 2: Approve all tool calls and continue
======================================================================

  Approving: get_user_id({"username": "alice"})
  Approving: get_user_id({"username": "bob"})

Response ID: resp_fbddc0c3-8c52-40c4-bf49-9fa790674fce
Output items (2):
  [0] mcp_list_tools: ['get_user_id', 'get_user_permissions', ...]
  [1] message: Please provide the usernames you'd like to look up, and
               I'll get their information for you.

======================================================================
Result
======================================================================
  Model response: Please provide the usernames you'd like to look up,
                  and I'll get their information for you.

  FAIL: No tools were executed after approval.
  The model lost conversation context due to history corruption.

Part 2: Running WITH the fix (assist branch)

Step 2.1 — Stop the server, check out assist branch, restart:

# Stop the previous server (Ctrl+C), then:
git checkout assist
OPENAI_API_KEY=sk-... llama stack run starter --port 8321 \
    --providers "inference=openai,responses=builtin"

Step 2.2 — Run the same test program:

uv run python scripts/repro_5294.py --url http://localhost:8321 --model openai/gpt-4o-mini

Step 2.3 — Output:

======================================================================
Step 1: Send multi-turn conversation to trigger multiple tool calls
======================================================================
  Model: openai/gpt-4o-mini
  require_approval: always
  Input messages (3):
    [0] user: Hi, I need help looking up some users.
    [1] assistant: Sure! Which users do you need me to look up?
    [2] user: Look up the user IDs for both "alice" and "bob" at the same time. ...

Response ID: resp_365c923b-ef2f-4a7f-b23d-52ea60c242a8
Output items (3):
  [0] mcp_list_tools: ['get_user_id', 'get_user_permissions', ...]
  [1] mcp_approval_request: get_user_id({"username": "alice"})
  [2] mcp_approval_request: get_user_id({"username": "bob"})

Got 2 approval requests.

======================================================================
Step 2: Approve all tool calls and continue
======================================================================

  Approving: get_user_id({"username": "alice"})
  Approving: get_user_id({"username": "bob"})

Response ID: resp_e43d79d7-eaf1-45b9-9242-07fcaee45a19
Output items (4):
  [0] mcp_list_tools: ['get_user_id', 'get_user_permissions', ...]
  [1] mcp_call: get_user_id({"username": "alice"}) -> user_12345
  [2] mcp_call: get_user_id({"username": "bob"}) -> user_67890
  [3] message: The user IDs are as follows: - Alice: user_12345 - Bob: user_67890

======================================================================
Result
======================================================================
  Tools executed: 2
    get_user_id({"username": "alice"}) -> user_12345
    get_user_id({"username": "bob"}) -> user_67890
  Final answer: The user IDs are as follows:
    - Alice: user_12345
    - Bob: user_67890

  PASS: Tools executed and correct answer returned.

Part 3: Comparison

Both runs use identical inputs: same 3-message conversation, same model (openai/gpt-4o-mini), same MCP server, same require_approval="always". Both Step 1 responses are identical — 2 mcp_approval_request items. The difference is in Step 2 after approving both tool calls:

main (without fix) assist (with fix)
Step 2 output items 2 (mcp_list_tools + message) 4 (mcp_list_tools + 2 mcp_call + message)
Tools executed? No Yes — both get_user_id calls
Model response "Please provide the usernames you'd like to look up" "Alice: user_12345, Bob: user_67890"
Result FAIL — model lost context, asked again PASS — correct answer

Why it fails on main: When _separate_tool_calls() processes the 2 approval-pending tool calls in Step 1, pop() runs twice. The first pop removes the assistant message (correct). The second pop removes msg[2] — the user's request "Look up alice and bob" (bug). When Step 2 reconstructs the conversation, the user's request is gone, so the model only sees "Hi, I need help" + "Sure! Which users?" and asks again.

Why it passes on assist: The fix replaces the inline pop() calls with a should_remove_assistant_msg flag. After the loop, pop() runs exactly once, removing only the assistant message. All 3 original messages are preserved.

@jaideepr97
Copy link
Copy Markdown
Contributor

Hi @gyliu513, sorry I missed this PR from you and also wound up raising #5303 solving the same bug
I saw the same fix from claude as included in your PR initially, but I think this approach does not handle the mixed case as well - when some tool calls are approved/don't need approval and do end up getting executed, this approach still pops the whole message leaving some orphaned tool call results in the subsequent messages

#5303 tracks which tool calls were executed and recreates a new message only containing those calls so that the tool call history is preserved, which I think might be a slightly better approach

For that reason I think one of us should close our PRs - since you got to the bug first, if you agree with the approach in #5303 I'm happy to close the PR and we can have this one updated
otherwise we could proceed with #5303 - let me know if you have thoughts either way

@gyliu513
Copy link
Copy Markdown
Contributor Author

Good point, thanks @jaideepr97 , I think you fix is good, let me close this PR and we can follow up on your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_separate_tool_calls corrupts conversation history when multiple MCP tool calls require approval

3 participants