Skip to content

fix(file_manager): output-token truncation guards + append_file tool (clean cherry-pick from #55)#64

Open
Starlitnightly wants to merge 13 commits intomainfrom
fix/file-manager-guards
Open

fix(file_manager): output-token truncation guards + append_file tool (clean cherry-pick from #55)#64
Starlitnightly wants to merge 13 commits intomainfrom
fix/file-manager-guards

Conversation

@Starlitnightly
Copy link
Copy Markdown
Collaborator

@Starlitnightly Starlitnightly commented Apr 4, 2026

Summary

Clean cherry-pick of the feature commits from PR #55, without the problematic revert history in the dev branch.

Problem with PR #55

PR #55 (devmain) carried revert history from dev that would erroneously delete files restored by PR #63:

  • pantheon/utils/token_optimization.py (-1,776 lines)
  • tests/test_token_optimization.py (-1,498 lines)
  • scripts/benchmark_prompt_cache.py (-385 lines)
  • scripts/benchmark_token_optimizations.py (-535 lines)
  • Plus rollback of token optimization integration in agent.py, pantheon.py, truncate.py

This happened because dev merged main's revert commit (3f1343a) via PR #61, so its diff against current main shows deletions of files that PR #63 just restored.

Solution

Cherry-picked only the 4 actual feature commits from dev, skipping all merge/revert commits:

  1. c6e57e4 — fix(prompts): add failure recovery protocol and scientific writing gate
  2. e3e073f — fix(file_manager): add output-token truncation guards and append_file tool
  3. c37a3c7 — test: add comprehensive tests for output-token truncation guards
  4. 7920a72 — fix(llm): set max_tokens to model's max output + raise tool guard thresholds

Changes

  • pantheon/toolsets/file/file_manager.py — output-token truncation guards, append_file tool, size formatting
  • pantheon/utils/llm.py — set max_tokens from model's max_output_tokens
  • pantheon/factory/templates/prompts/delegation.md — failure recovery protocol
  • pantheon/factory/templates/teams/default.md — scientific writing gate
  • tests/test_file_manager.py — comprehensive truncation guard tests

What's NOT included (intentionally)

All token optimization deletions from the dev branch revert history — those files are preserved intact from PR #63.

Test plan

hazelian0619 and others added 10 commits April 3, 2026 21:57
Problem A (partial): Add MANDATORY scientific writing gate to default.md —
Leader must delegate to Researcher before writing any domain paper. Clarify
Scientific Illustrator scope (schematic/pathway diagrams only, not data plots).

Problem C: Add Failure Recovery section to delegation.md — three-tier ladder
for file write failures (Two-Phase Write Protocol → format downgrade → inline)
and sub-agent failures (narrow retry → self-execute → partial output). Hard
rule: never terminate without producing at least one artifact.

Validated by experiment (2026-03-30):
- Case 3 (SSR1/GWAS): Leader called 3x parallel Researcher before any content;
  Researchers produced 978 lines across 3 reports using Two-Phase Write Protocol
- Case 0 (EC论文): Leader called 2x parallel Researcher; BibTeX built to 397
  lines via append_file batches (vs. previous silent truncation at char 88);
  PDF artifact (117KB) delivered despite E2BIG and relay-API update_file errors

New bugs discovered (tracked separately):
- Relay API truncates update_file tool call args mid-generation (high severity)
- think tool infinite loop at ~90K token context (medium severity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… tool

P0 bug: when LLM generates large files (LaTeX papers, BibTeX) in a single
write_file/update_file call, the relay API truncates the output stream mid-JSON,
causing 'Unterminated string' parse errors and silent data loss.

Root cause: LLM output token limit is separate from context window. File content
in tool call parameters must be generated as LLM output, hitting max_tokens before
the JSON closes. LaTeX/BibTeX content with escape chars inflates token count ~1.5x.

Changes:
- write_file: hard reject content > 12,000 chars; docstring teaches Two-Phase
  Write Protocol (scaffold first, fill by section, append for lists/bib)
- append_file: new tool for chunked appending; 6,000 char limit; requires file
  to exist first; primary use case is BibTeX batches (<=10 entries per call)
- update_file: hard reject new_string > 8,000 chars with guidance to split
  section into smaller semantic units

Validated against 20-case baseline (15% success rate before fix):
- Case 1 (LaTeX review paper, previously FAIL): now generates full PDF with
  44 references via append_file batches — confirmed in controlled re-run
- Agent proactively adopted Two-Phase protocol after reading docstring (0
  content_too_large rejections; protocol was followed before guard triggered)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests for PR #52 file manager changes:
- write_file: reject >12K, accept at limit, file not created on reject
- append_file: basic append, multi-batch (BibTeX pattern), reject
  nonexistent file, reject >6K, accept at limit
- update_file: reject new_string >8K, accept at limit, original unchanged
- Two-Phase Write Protocol end-to-end: scaffold → section fill → append

14/14 file manager tests passing.
…esholds

Root cause fix: acompletion_litellm() never passed max_tokens (output)
to litellm. Anthropic models default to 4096 output tokens, causing
tool_use JSON to be truncated mid-generation when the model writes
large file content.

Fix: auto-detect model's max_output_tokens via litellm.get_model_info()
and set it as kwargs["max_tokens"] when not already specified by
model_params.

With the root cause fixed, the tool-level size guards from PR #52 are
now defense-in-depth (not the primary fix). Raised thresholds to match
actual output capacity:
- write_file: 12K → 40K chars
- update_file: 8K → 30K chars
- append_file: 6K → 20K chars

Thresholds moved to class-level constants (WRITE_FILE_MAX_CHARS, etc.)
for easy per-deployment tuning. Tests updated to reference constants
instead of hardcoded values.

14/14 file manager tests passing.
Replace litellm.utils.get_model_info with provider_registry.get_model_info
and fix kwargs → model_params for the new acompletion signature.

Add tests:
- test_max_tokens_auto_set: verify catalog has correct max_output_tokens
- test_max_tokens_live_openai: live API test (skipped without OPENAI_API_KEY)
…ite_file

Root cause of output truncation was missing max_tokens (fixed in 7920a72),
not tool-level size limits. The guards caused worse problems for code
generation (repeated rejections, lost content, wasted tokens).

Changes:
- Remove WRITE_FILE_MAX_CHARS/APPEND_FILE_MAX_CHARS/UPDATE_FILE_MAX_CHARS
- Remove content_too_large guards from write_file/update_file
- Merge append_file into write_file(append=True)
- Update tests: verify large content writes succeed (100K+ chars)
- Add live integration test (paper + code scenarios)

Live test results (gpt-4.1-mini, no guards):
- LaTeX paper: 9,208 chars, 1 write_file call, 0 rejections
- Python code: 13,754 chars (314 lines, 4 classes), 1 call, 0 rejections

Previously with guards (2K limit test):
- Python code: 24 calls, 8 rejections, 3/4 classes MISSING
Eliminated the detailed failure recovery guidelines for tool and sub-agent errors from delegation.md to streamline the document. This change simplifies the prompts and focuses on essential delegation instructions.
@Starlitnightly Starlitnightly requested a review from zqbake April 4, 2026 05:59
- Added new models to the LLM catalog, including "anthropic" and "google-auth" with their respective dependencies and configurations.
- Implemented dynamic handling of output token parameters across different providers to ensure compatibility and prevent truncation issues.
- Updated the LLM response handling to utilize model-specific output token limits, improving the robustness of API calls.
- Enhanced tests to verify the correct application of output token parameters and model information retrieval.

This update aims to streamline interactions with various LLM providers and improve overall system reliability.
- Updated the logic in acompletion_responses and acompletion functions to ensure that the output token parameter is validated alongside the model's max output tokens.
- Modified the get_output_token_param function to return None when no valid parameter is found, enhancing type safety and clarity in handling token parameters.

These changes aim to improve the robustness of token management across different LLM providers.
- Eliminated the output token recovery functions and related logic from the OpenAIAdapter class, streamlining the codebase.
- Updated the acompletion method to remove unnecessary recovery attempts, enhancing clarity and maintainability.

These changes focus on simplifying the adapter's implementation while ensuring it remains effective for handling OpenAI API interactions.
@Starlitnightly
Copy link
Copy Markdown
Collaborator Author

This change fixes failures caused by provider / endpoint-specific output-token parameter differences, and makes the catalog the single source of truth for those parameter names.

What changed:

  • Added explicit output-token parameter metadata to pantheon/utils/llm_catalog.json:
    • chat_output_token_param
    • responses_output_token_param
  • Added catalog metadata for openai/gpt-4o-mini, including max_output_tokens=16384, to avoid falling back to the old default of 32000 and triggering completion-token limit errors.
  • Added get_output_token_param() in pantheon/utils/provider_registry.py. This helper now reads only from the catalog and returns None if the catalog does not define a parameter name.
  • Updated pantheon/utils/llm.py so output-token parameters are injected only when the catalog explicitly defines them:
    • Chat path uses the catalog-defined chat parameter name
    • Responses path uses the catalog-defined responses parameter name
    • No implicit fallback guessing is performed when the catalog does not specify a field
  • Removed provider-level fallback hardcoding for output-token parameter names from the runtime path, so catalog metadata is now the only source used for automatic parameter injection.

Tests updated / added:

  • Updated tests/test_agent.py test_tool_timeout
    • The old version depended on a live LLM call and no longer matched the current timeout-to-background behavior
    • The new version tests the tool-dispatch layer directly and verifies timeout adoption into background execution
  • Added coverage in tests/test_provider_adapters.py for:
    • catalog output-token parameter mapping
    • gpt-4o-mini catalog metadata
    • llm.py injecting max_completion_tokens from catalog metadata

Verification:

  • Relevant regression subset passed: 16 passed
  • py_compile passed

Design outcome:

  • Output-token parameter naming now comes from the catalog as the single source of truth
  • Runtime no longer carries provider-level fallback hardcoded parameter-name mappings
  • If the catalog does not define an output-token parameter for a provider / endpoint, Pantheon no longer guesses one during automatic injection

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant