fix(file_manager): output-token truncation guards + append_file tool (clean cherry-pick from #55)#64
Open
Starlitnightly wants to merge 13 commits intomainfrom
Open
fix(file_manager): output-token truncation guards + append_file tool (clean cherry-pick from #55)#64Starlitnightly wants to merge 13 commits intomainfrom
Starlitnightly wants to merge 13 commits intomainfrom
Conversation
Problem A (partial): Add MANDATORY scientific writing gate to default.md — Leader must delegate to Researcher before writing any domain paper. Clarify Scientific Illustrator scope (schematic/pathway diagrams only, not data plots). Problem C: Add Failure Recovery section to delegation.md — three-tier ladder for file write failures (Two-Phase Write Protocol → format downgrade → inline) and sub-agent failures (narrow retry → self-execute → partial output). Hard rule: never terminate without producing at least one artifact. Validated by experiment (2026-03-30): - Case 3 (SSR1/GWAS): Leader called 3x parallel Researcher before any content; Researchers produced 978 lines across 3 reports using Two-Phase Write Protocol - Case 0 (EC论文): Leader called 2x parallel Researcher; BibTeX built to 397 lines via append_file batches (vs. previous silent truncation at char 88); PDF artifact (117KB) delivered despite E2BIG and relay-API update_file errors New bugs discovered (tracked separately): - Relay API truncates update_file tool call args mid-generation (high severity) - think tool infinite loop at ~90K token context (medium severity) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… tool P0 bug: when LLM generates large files (LaTeX papers, BibTeX) in a single write_file/update_file call, the relay API truncates the output stream mid-JSON, causing 'Unterminated string' parse errors and silent data loss. Root cause: LLM output token limit is separate from context window. File content in tool call parameters must be generated as LLM output, hitting max_tokens before the JSON closes. LaTeX/BibTeX content with escape chars inflates token count ~1.5x. Changes: - write_file: hard reject content > 12,000 chars; docstring teaches Two-Phase Write Protocol (scaffold first, fill by section, append for lists/bib) - append_file: new tool for chunked appending; 6,000 char limit; requires file to exist first; primary use case is BibTeX batches (<=10 entries per call) - update_file: hard reject new_string > 8,000 chars with guidance to split section into smaller semantic units Validated against 20-case baseline (15% success rate before fix): - Case 1 (LaTeX review paper, previously FAIL): now generates full PDF with 44 references via append_file batches — confirmed in controlled re-run - Agent proactively adopted Two-Phase protocol after reading docstring (0 content_too_large rejections; protocol was followed before guard triggered) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests for PR #52 file manager changes: - write_file: reject >12K, accept at limit, file not created on reject - append_file: basic append, multi-batch (BibTeX pattern), reject nonexistent file, reject >6K, accept at limit - update_file: reject new_string >8K, accept at limit, original unchanged - Two-Phase Write Protocol end-to-end: scaffold → section fill → append 14/14 file manager tests passing.
…esholds Root cause fix: acompletion_litellm() never passed max_tokens (output) to litellm. Anthropic models default to 4096 output tokens, causing tool_use JSON to be truncated mid-generation when the model writes large file content. Fix: auto-detect model's max_output_tokens via litellm.get_model_info() and set it as kwargs["max_tokens"] when not already specified by model_params. With the root cause fixed, the tool-level size guards from PR #52 are now defense-in-depth (not the primary fix). Raised thresholds to match actual output capacity: - write_file: 12K → 40K chars - update_file: 8K → 30K chars - append_file: 6K → 20K chars Thresholds moved to class-level constants (WRITE_FILE_MAX_CHARS, etc.) for easy per-deployment tuning. Tests updated to reference constants instead of hardcoded values. 14/14 file manager tests passing.
Replace litellm.utils.get_model_info with provider_registry.get_model_info and fix kwargs → model_params for the new acompletion signature. Add tests: - test_max_tokens_auto_set: verify catalog has correct max_output_tokens - test_max_tokens_live_openai: live API test (skipped without OPENAI_API_KEY)
…ite_file Root cause of output truncation was missing max_tokens (fixed in 7920a72), not tool-level size limits. The guards caused worse problems for code generation (repeated rejections, lost content, wasted tokens). Changes: - Remove WRITE_FILE_MAX_CHARS/APPEND_FILE_MAX_CHARS/UPDATE_FILE_MAX_CHARS - Remove content_too_large guards from write_file/update_file - Merge append_file into write_file(append=True) - Update tests: verify large content writes succeed (100K+ chars) - Add live integration test (paper + code scenarios) Live test results (gpt-4.1-mini, no guards): - LaTeX paper: 9,208 chars, 1 write_file call, 0 rejections - Python code: 13,754 chars (314 lines, 4 classes), 1 call, 0 rejections Previously with guards (2K limit test): - Python code: 24 calls, 8 rejections, 3/4 classes MISSING
Eliminated the detailed failure recovery guidelines for tool and sub-agent errors from delegation.md to streamline the document. This change simplifies the prompts and focuses on essential delegation instructions.
- Added new models to the LLM catalog, including "anthropic" and "google-auth" with their respective dependencies and configurations. - Implemented dynamic handling of output token parameters across different providers to ensure compatibility and prevent truncation issues. - Updated the LLM response handling to utilize model-specific output token limits, improving the robustness of API calls. - Enhanced tests to verify the correct application of output token parameters and model information retrieval. This update aims to streamline interactions with various LLM providers and improve overall system reliability.
- Updated the logic in acompletion_responses and acompletion functions to ensure that the output token parameter is validated alongside the model's max output tokens. - Modified the get_output_token_param function to return None when no valid parameter is found, enhancing type safety and clarity in handling token parameters. These changes aim to improve the robustness of token management across different LLM providers.
- Eliminated the output token recovery functions and related logic from the OpenAIAdapter class, streamlining the codebase. - Updated the acompletion method to remove unnecessary recovery attempts, enhancing clarity and maintainability. These changes focus on simplifying the adapter's implementation while ensuring it remains effective for handling OpenAI API interactions.
Collaborator
Author
|
This change fixes failures caused by provider / endpoint-specific output-token parameter differences, and makes the catalog the single source of truth for those parameter names. What changed:
Tests updated / added:
Verification:
Design outcome:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Clean cherry-pick of the feature commits from PR #55, without the problematic revert history in the
devbranch.Problem with PR #55
PR #55 (
dev→main) carried revert history fromdevthat would erroneously delete files restored by PR #63:pantheon/utils/token_optimization.py(-1,776 lines)tests/test_token_optimization.py(-1,498 lines)scripts/benchmark_prompt_cache.py(-385 lines)scripts/benchmark_token_optimizations.py(-535 lines)agent.py,pantheon.py,truncate.pyThis happened because
devmerged main's revert commit (3f1343a) via PR #61, so its diff against current main shows deletions of files that PR #63 just restored.Solution
Cherry-picked only the 4 actual feature commits from
dev, skipping all merge/revert commits:c6e57e4— fix(prompts): add failure recovery protocol and scientific writing gatee3e073f— fix(file_manager): add output-token truncation guards and append_file toolc37a3c7— test: add comprehensive tests for output-token truncation guards7920a72— fix(llm): set max_tokens to model's max output + raise tool guard thresholdsChanges
pantheon/toolsets/file/file_manager.py— output-token truncation guards, append_file tool, size formattingpantheon/utils/llm.py— set max_tokens from model's max_output_tokenspantheon/factory/templates/prompts/delegation.md— failure recovery protocolpantheon/factory/templates/teams/default.md— scientific writing gatetests/test_file_manager.py— comprehensive truncation guard testsWhat's NOT included (intentionally)
All token optimization deletions from the
devbranch revert history — those files are preserved intact from PR #63.Test plan
pytest tests/test_file_manager.py— new truncation guard tests pass