fix: prevent conversation history corruption when multiple tool calls require approval by gyliu513 · Pull Request #5294 · llamastack/llama-stack

gyliu513 · 2026-03-25T16:56:04Z

Fix a bug in _separate_tool_calls() where next_turn_messages.pop() was called inside the inner for tool_call loop, causing it to execute once per tool call needing approval instead of once per choice. When multiple tool calls in a single choice required approval, this deleted unrelated messages from the conversation history.

The fix is replace the inline pop() calls with a should_remove_assistant_msg flag that is evaluated once after the inner loop completes, ensuring only the assistant message is removed.

What does this PR do?

Test Plan

leseb

please add a repro script and a test plan

… require approval Fix a bug in _separate_tool_calls() where next_turn_messages.pop() was called inside the inner for tool_call loop, causing it to execute once per tool call needing approval instead of once per choice. When multiple tool calls in a single choice required approval, this deleted unrelated messages from the conversation history. The fix is replace the inline pop() calls with a should_remove_assistant_msg flag that is evaluated once after the inner loop completes, ensuring only the assistant message is removed.

gyliu513 · 2026-03-26T13:39:22Z

@leseb here are some detailed steps for reproduce against main branch and how the fix works.

Test program

#!/usr/bin/env python3
# scripts/repro_5294.py

"""Reproduce issue #5293: conversation history corruption when multiple
MCP tool calls require approval in a single model response.

This script connects to a running llama-stack server, starts a local MCP
server with 5 tools (require_approval="always"), and sends a multi-turn
conversation that triggers 2 parallel tool calls needing approval.

Prerequisites:
    1. Start a llama-stack server:
       OPENAI_API_KEY=sk-... llama stack run starter --port 8321 \\
           --providers "inference=openai,responses=builtin"

    2. Run this script:
       uv run python scripts/repro_5294.py --model openai/gpt-4o-mini

Usage:
    uv run python scripts/repro_5294.py --url http://localhost:8321 --model openai/gpt-4o-mini
"""

import argparse
import sys

sys.path.insert(0, ".")
from tests.common.mcp import dependency_tools, make_mcp_server


def print_output_items(response):
    """Pretty-print response output items."""
    for i, item in enumerate(response.output):
        item_type = item.type
        if item_type == "mcp_list_tools":
            tool_names = [t.name for t in item.tools]
            print(f"  [{i}] mcp_list_tools: {tool_names}")
        elif item_type == "mcp_approval_request":
            print(f"  [{i}] mcp_approval_request: {item.name}({item.arguments})")
        elif item_type == "mcp_call":
            print(f"  [{i}] mcp_call: {item.name}({item.arguments}) -> {item.output}")
        elif item_type == "message":
            text = item.content[0].text if item.content else ""
            preview = text[:200].replace("\n", " ")
            print(f"  [{i}] message: {preview}")
        else:
            print(f"  [{i}] {item_type}")


def run_demo(client, model, mcp_server_url):
    tools = [
        {
            "type": "mcp",
            "server_label": "localmcp",
            "server_url": mcp_server_url,
            "require_approval": "always",
        }
    ]

    # Multi-turn conversation with 3 messages.
    # The bug deletes messages from this history when processing multiple
    # approval-pending tool calls, causing the model to lose context.
    input_messages = [
        {"role": "user", "content": "Hi, I need help looking up some users."},
        {"role": "assistant", "content": "Sure! Which users do you need me to look up?"},
        {
            "role": "user",
            "content": (
                'Look up the user IDs for both "alice" and "bob" at the same time. '
                "Call get_user_id for both users in parallel - do NOT call one and wait, "
                "call both at once."
            ),
        },
    ]

    print()
    print("=" * 70)
    print("Step 1: Send multi-turn conversation to trigger multiple tool calls")
    print("=" * 70)
    print(f"  Model: {model}")
    print(f"  require_approval: always")
    print(f"  Input messages ({len(input_messages)}):")
    for i, m in enumerate(input_messages):
        print(f"    [{i}] {m['role']}: {m['content']}")
    print()

    response = client.responses.create(
        model=model,
        input=input_messages,
        tools=tools,
        stream=False,
    )

    print(f"Response ID: {response.id}")
    print(f"Output items ({len(response.output)}):")
    print_output_items(response)

    approval_requests = [o for o in response.output if o.type == "mcp_approval_request"]

    if len(approval_requests) >= 2:
        print()
        print(f"Got {len(approval_requests)} approval requests.")
        print()
        print("=" * 70)
        print("Step 2: Approve all tool calls and continue")
        print("=" * 70)
        print()

        approval_inputs = []
        for req in approval_requests:
            print(f"  Approving: {req.name}({req.arguments})")
            approval_inputs.append(
                {
                    "type": "mcp_approval_response",
                    "approval_request_id": req.id,
                    "approve": True,
                }
            )

        print()
        response2 = client.responses.create(
            previous_response_id=response.id,
            model=model,
            input=approval_inputs,
            tools=tools,
            stream=False,
        )

        print(f"Response ID: {response2.id}")
        print(f"Output items ({len(response2.output)}):")
        print_output_items(response2)

        # Analyze result
        mcp_calls = [o for o in response2.output if o.type == "mcp_call"]
        messages = [o for o in response2.output if o.type == "message"]

        print()
        print("=" * 70)
        print("Result")
        print("=" * 70)
        if mcp_calls:
            print(f"  Tools executed: {len(mcp_calls)}")
            for call in mcp_calls:
                print(f"    {call.name}({call.arguments}) -> {call.output}")
            if messages:
                print(f"  Final answer: {messages[-1].content[0].text[:200]}")
            print()
            print("  PASS: Tools executed and correct answer returned.")
        else:
            if messages:
                print(f"  Model response: {messages[-1].content[0].text[:200]}")
            print()
            print("  FAIL: No tools were executed after approval.")
            print("  The model lost conversation context due to history corruption.")

    elif len(approval_requests) == 1:
        print()
        print("NOTE: Model only generated 1 tool call (need >= 2 to trigger the bug).")
        print("Try re-running the script.")
    else:
        print()
        print("NOTE: No approval requests found. Try re-running the script.")


def main():
    parser = argparse.ArgumentParser(
        description="Reproduce issue #5293: conversation history corruption"
    )
    parser.add_argument(
        "--url",
        default="http://localhost:8321",
        help="URL of the llama-stack server (default: http://localhost:8321)",
    )
    parser.add_argument(
        "--model",
        default="openai/gpt-4o-mini",
        help="Model ID to use (default: openai/gpt-4o-mini)",
    )
    args = parser.parse_args()

    from llama_stack_client import LlamaStackClient

    print(f"Connecting to {args.url}")
    client = LlamaStackClient(base_url=args.url)

    print("Starting MCP test server with dependency_tools (5 tools)...")
    with make_mcp_server(tools=dependency_tools()) as mcp_server_info:
        server_url = mcp_server_info["server_url"]
        print(f"MCP server ready at {server_url}")
        run_demo(client, args.model, server_url)


if __name__ == "__main__":
    main()

Test Overview

Bug: _separate_tool_calls() in streaming.py calls next_turn_messages.pop() inside the inner for tool_call loop. When a model response contains 2+ MCP tool calls needing approval, pop() runs once per tool call instead of once total, deleting unrelated messages from conversation history.

Test program: scripts/repro_5294.py: connects to a running llama-stack server, starts a local MCP server with 5 tools (require_approval="always"), sends a 3-message multi-turn conversation designed to trigger 2 parallel get_user_id calls.

Part 1: Running WITHOUT the fix (`main` branch)

Step 1.1 — Check out main branch and start the server:

git checkout main
OPENAI_API_KEY=sk-... llama stack run starter --port 8321 \
    --providers "inference=openai,responses=builtin"

Step 1.2 — Run the test program (from a second terminal):

uv run python scripts/repro_5294.py --url http://localhost:8321 --model openai/gpt-4o-mini

Step 1.3 — Output:

======================================================================
Step 1: Send multi-turn conversation to trigger multiple tool calls
======================================================================
  Model: openai/gpt-4o-mini
  require_approval: always
  Input messages (3):
    [0] user: Hi, I need help looking up some users.
    [1] assistant: Sure! Which users do you need me to look up?
    [2] user: Look up the user IDs for both "alice" and "bob" at the same time. ...

Response ID: resp_b9ae37e2-9f7e-4d23-9904-19457b29d698
Output items (3):
  [0] mcp_list_tools: ['get_user_id', 'get_user_permissions', ...]
  [1] mcp_approval_request: get_user_id({"username": "alice"})
  [2] mcp_approval_request: get_user_id({"username": "bob"})

Got 2 approval requests.

======================================================================
Step 2: Approve all tool calls and continue
======================================================================

  Approving: get_user_id({"username": "alice"})
  Approving: get_user_id({"username": "bob"})

Response ID: resp_fbddc0c3-8c52-40c4-bf49-9fa790674fce
Output items (2):
  [0] mcp_list_tools: ['get_user_id', 'get_user_permissions', ...]
  [1] message: Please provide the usernames you'd like to look up, and
               I'll get their information for you.

======================================================================
Result
======================================================================
  Model response: Please provide the usernames you'd like to look up,
                  and I'll get their information for you.

  FAIL: No tools were executed after approval.
  The model lost conversation context due to history corruption.

Part 2: Running WITH the fix (`assist` branch)

Step 2.1 — Stop the server, check out assist branch, restart:

# Stop the previous server (Ctrl+C), then:
git checkout assist
OPENAI_API_KEY=sk-... llama stack run starter --port 8321 \
    --providers "inference=openai,responses=builtin"

Step 2.2 — Run the same test program:

uv run python scripts/repro_5294.py --url http://localhost:8321 --model openai/gpt-4o-mini

Step 2.3 — Output:

======================================================================
Step 1: Send multi-turn conversation to trigger multiple tool calls
======================================================================
  Model: openai/gpt-4o-mini
  require_approval: always
  Input messages (3):
    [0] user: Hi, I need help looking up some users.
    [1] assistant: Sure! Which users do you need me to look up?
    [2] user: Look up the user IDs for both "alice" and "bob" at the same time. ...

Response ID: resp_365c923b-ef2f-4a7f-b23d-52ea60c242a8
Output items (3):
  [0] mcp_list_tools: ['get_user_id', 'get_user_permissions', ...]
  [1] mcp_approval_request: get_user_id({"username": "alice"})
  [2] mcp_approval_request: get_user_id({"username": "bob"})

Got 2 approval requests.

======================================================================
Step 2: Approve all tool calls and continue
======================================================================

  Approving: get_user_id({"username": "alice"})
  Approving: get_user_id({"username": "bob"})

Response ID: resp_e43d79d7-eaf1-45b9-9242-07fcaee45a19
Output items (4):
  [0] mcp_list_tools: ['get_user_id', 'get_user_permissions', ...]
  [1] mcp_call: get_user_id({"username": "alice"}) -> user_12345
  [2] mcp_call: get_user_id({"username": "bob"}) -> user_67890
  [3] message: The user IDs are as follows: - Alice: user_12345 - Bob: user_67890

======================================================================
Result
======================================================================
  Tools executed: 2
    get_user_id({"username": "alice"}) -> user_12345
    get_user_id({"username": "bob"}) -> user_67890
  Final answer: The user IDs are as follows:
    - Alice: user_12345
    - Bob: user_67890

  PASS: Tools executed and correct answer returned.

Part 3: Comparison

Both runs use identical inputs: same 3-message conversation, same model (openai/gpt-4o-mini), same MCP server, same require_approval="always". Both Step 1 responses are identical — 2 mcp_approval_request items. The difference is in Step 2 after approving both tool calls:

	`main` (without fix)	`assist` (with fix)
Step 2 output items	2 (mcp_list_tools + message)	4 (mcp_list_tools + 2 mcp_call + message)
Tools executed?	No	Yes — both `get_user_id` calls
Model response	"Please provide the usernames you'd like to look up"	"Alice: user_12345, Bob: user_67890"
Result	FAIL — model lost context, asked again	PASS — correct answer

Why it fails on main: When _separate_tool_calls() processes the 2 approval-pending tool calls in Step 1, pop() runs twice. The first pop removes the assistant message (correct). The second pop removes msg[2] — the user's request "Look up alice and bob" (bug). When Step 2 reconstructs the conversation, the user's request is gone, so the model only sees "Hi, I need help" + "Sure! Which users?" and asks again.

Why it passes on assist: The fix replaces the inline pop() calls with a should_remove_assistant_msg flag. After the loop, pop() runs exactly once, removing only the assistant message. All 3 original messages are preserved.

jaideepr97 · 2026-03-27T14:29:45Z

Hi @gyliu513, sorry I missed this PR from you and also wound up raising #5303 solving the same bug
I saw the same fix from claude as included in your PR initially, but I think this approach does not handle the mixed case as well - when some tool calls are approved/don't need approval and do end up getting executed, this approach still pops the whole message leaving some orphaned tool call results in the subsequent messages

#5303 tracks which tool calls were executed and recreates a new message only containing those calls so that the tool call history is preserved, which I think might be a slightly better approach

For that reason I think one of us should close our PRs - since you got to the bug first, if you agree with the approach in #5303 I'm happy to close the PR and we can have this one updated
otherwise we could proceed with #5303 - let me know if you have thoughts either way

gyliu513 · 2026-03-27T14:45:47Z

Good point, thanks @jaideepr97 , I think you fix is good, let me close this PR and we can follow up on your PR.

gyliu513 requested review from ashwinb, bbrowning, cdoern, ehhuang, franciscojavierarceo, leseb, mattf and raghotham as code owners March 25, 2026 16:56

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 25, 2026

gyliu513 force-pushed the assist branch from 6c67227 to 111d687 Compare March 25, 2026 17:22

leseb requested changes Mar 26, 2026

View reviewed changes

gyliu513 force-pushed the assist branch from 05bc7cb to 4cc24e3 Compare March 26, 2026 12:49

gyliu513 closed this Mar 27, 2026

gyliu513 deleted the assist branch March 27, 2026 14:45

gyliu513 mentioned this pull request Mar 27, 2026

fix: replace blunt pop with assistant message rewriting in _separate_tool_calls #5303

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent conversation history corruption when multiple tool calls require approval#5294

fix: prevent conversation history corruption when multiple tool calls require approval#5294
gyliu513 wants to merge 1 commit intollamastack:mainfrom
gyliu513:assist

gyliu513 commented Mar 25, 2026

Uh oh!

leseb left a comment

Uh oh!

gyliu513 commented Mar 26, 2026

Uh oh!

jaideepr97 commented Mar 27, 2026

Uh oh!

gyliu513 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gyliu513 commented Mar 25, 2026

What does this PR do?

Test Plan

Uh oh!

leseb left a comment

Choose a reason for hiding this comment

Uh oh!

gyliu513 commented Mar 26, 2026

Test program

Test Overview

Part 1: Running WITHOUT the fix (main branch)

Part 2: Running WITH the fix (assist branch)

Part 3: Comparison

Uh oh!

jaideepr97 commented Mar 27, 2026

Uh oh!

gyliu513 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Part 1: Running WITHOUT the fix (`main` branch)

Part 2: Running WITH the fix (`assist` branch)