fix(file_manager): output-token truncation guards + append_file tool (clean cherry-pick from #55) by Starlitnightly · Pull Request #64 · aristoteleo/PantheonOS

Starlitnightly · 2026-04-04T04:58:32Z

Summary

Clean cherry-pick of the feature commits from PR #55, without the problematic revert history in the dev branch.

Problem with PR #55

PR #55 (dev → main) carried revert history from dev that would erroneously delete files restored by PR #63:

pantheon/utils/token_optimization.py (-1,776 lines)
tests/test_token_optimization.py (-1,498 lines)
scripts/benchmark_prompt_cache.py (-385 lines)
scripts/benchmark_token_optimizations.py (-535 lines)
Plus rollback of token optimization integration in agent.py, pantheon.py, truncate.py

This happened because dev merged main's revert commit (3f1343a) via PR #61, so its diff against current main shows deletions of files that PR #63 just restored.

Solution

Cherry-picked only the 4 actual feature commits from dev, skipping all merge/revert commits:

c6e57e4 — fix(prompts): add failure recovery protocol and scientific writing gate
e3e073f — fix(file_manager): add output-token truncation guards and append_file tool
c37a3c7 — test: add comprehensive tests for output-token truncation guards
7920a72 — fix(llm): set max_tokens to model's max output + raise tool guard thresholds

Changes

pantheon/toolsets/file/file_manager.py — output-token truncation guards, append_file tool, size formatting
pantheon/utils/llm.py — set max_tokens from model's max_output_tokens
pantheon/factory/templates/prompts/delegation.md — failure recovery protocol
pantheon/factory/templates/teams/default.md — scientific writing gate
tests/test_file_manager.py — comprehensive truncation guard tests

What's NOT included (intentionally)

All token optimization deletions from the dev branch revert history — those files are preserved intact from PR #63.

Test plan

pytest tests/test_file_manager.py — new truncation guard tests pass
Verify no token optimization files are affected
PR fix(file_manager): output-token truncation guards + append_file tool #55 can be closed after this merges

Problem A (partial): Add MANDATORY scientific writing gate to default.md — Leader must delegate to Researcher before writing any domain paper. Clarify Scientific Illustrator scope (schematic/pathway diagrams only, not data plots). Problem C: Add Failure Recovery section to delegation.md — three-tier ladder for file write failures (Two-Phase Write Protocol → format downgrade → inline) and sub-agent failures (narrow retry → self-execute → partial output). Hard rule: never terminate without producing at least one artifact. Validated by experiment (2026-03-30): - Case 3 (SSR1/GWAS): Leader called 3x parallel Researcher before any content; Researchers produced 978 lines across 3 reports using Two-Phase Write Protocol - Case 0 (EC论文): Leader called 2x parallel Researcher; BibTeX built to 397 lines via append_file batches (vs. previous silent truncation at char 88); PDF artifact (117KB) delivered despite E2BIG and relay-API update_file errors New bugs discovered (tracked separately): - Relay API truncates update_file tool call args mid-generation (high severity) - think tool infinite loop at ~90K token context (medium severity) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… tool P0 bug: when LLM generates large files (LaTeX papers, BibTeX) in a single write_file/update_file call, the relay API truncates the output stream mid-JSON, causing 'Unterminated string' parse errors and silent data loss. Root cause: LLM output token limit is separate from context window. File content in tool call parameters must be generated as LLM output, hitting max_tokens before the JSON closes. LaTeX/BibTeX content with escape chars inflates token count ~1.5x. Changes: - write_file: hard reject content > 12,000 chars; docstring teaches Two-Phase Write Protocol (scaffold first, fill by section, append for lists/bib) - append_file: new tool for chunked appending; 6,000 char limit; requires file to exist first; primary use case is BibTeX batches (<=10 entries per call) - update_file: hard reject new_string > 8,000 chars with guidance to split section into smaller semantic units Validated against 20-case baseline (15% success rate before fix): - Case 1 (LaTeX review paper, previously FAIL): now generates full PDF with 44 references via append_file batches — confirmed in controlled re-run - Agent proactively adopted Two-Phase protocol after reading docstring (0 content_too_large rejections; protocol was followed before guard triggered) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Tests for PR #52 file manager changes: - write_file: reject >12K, accept at limit, file not created on reject - append_file: basic append, multi-batch (BibTeX pattern), reject nonexistent file, reject >6K, accept at limit - update_file: reject new_string >8K, accept at limit, original unchanged - Two-Phase Write Protocol end-to-end: scaffold → section fill → append 14/14 file manager tests passing.

…esholds Root cause fix: acompletion_litellm() never passed max_tokens (output) to litellm. Anthropic models default to 4096 output tokens, causing tool_use JSON to be truncated mid-generation when the model writes large file content. Fix: auto-detect model's max_output_tokens via litellm.get_model_info() and set it as kwargs["max_tokens"] when not already specified by model_params. With the root cause fixed, the tool-level size guards from PR #52 are now defense-in-depth (not the primary fix). Raised thresholds to match actual output capacity: - write_file: 12K → 40K chars - update_file: 8K → 30K chars - append_file: 6K → 20K chars Thresholds moved to class-level constants (WRITE_FILE_MAX_CHARS, etc.) for easy per-deployment tuning. Tests updated to reference constants instead of hardcoded values. 14/14 file manager tests passing.

Replace litellm.utils.get_model_info with provider_registry.get_model_info and fix kwargs → model_params for the new acompletion signature. Add tests: - test_max_tokens_auto_set: verify catalog has correct max_output_tokens - test_max_tokens_live_openai: live API test (skipped without OPENAI_API_KEY)

…ite_file Root cause of output truncation was missing max_tokens (fixed in 7920a72), not tool-level size limits. The guards caused worse problems for code generation (repeated rejections, lost content, wasted tokens). Changes: - Remove WRITE_FILE_MAX_CHARS/APPEND_FILE_MAX_CHARS/UPDATE_FILE_MAX_CHARS - Remove content_too_large guards from write_file/update_file - Merge append_file into write_file(append=True) - Update tests: verify large content writes succeed (100K+ chars) - Add live integration test (paper + code scenarios) Live test results (gpt-4.1-mini, no guards): - LaTeX paper: 9,208 chars, 1 write_file call, 0 rejections - Python code: 13,754 chars (314 lines, 4 classes), 1 call, 0 rejections Previously with guards (2K limit test): - Python code: 24 calls, 8 rejections, 3/4 classes MISSING

…lure recovery

…io domains

Eliminated the detailed failure recovery guidelines for tool and sub-agent errors from delegation.md to streamline the document. This change simplifies the prompts and focuses on essential delegation instructions.

- Added new models to the LLM catalog, including "anthropic" and "google-auth" with their respective dependencies and configurations. - Implemented dynamic handling of output token parameters across different providers to ensure compatibility and prevent truncation issues. - Updated the LLM response handling to utilize model-specific output token limits, improving the robustness of API calls. - Enhanced tests to verify the correct application of output token parameters and model information retrieval. This update aims to streamline interactions with various LLM providers and improve overall system reliability.

- Updated the logic in acompletion_responses and acompletion functions to ensure that the output token parameter is validated alongside the model's max output tokens. - Modified the get_output_token_param function to return None when no valid parameter is found, enhancing type safety and clarity in handling token parameters. These changes aim to improve the robustness of token management across different LLM providers.

- Eliminated the output token recovery functions and related logic from the OpenAIAdapter class, streamlining the codebase. - Updated the acompletion method to remove unnecessary recovery attempts, enhancing clarity and maintainability. These changes focus on simplifying the adapter's implementation while ensuring it remains effective for handling OpenAI API interactions.

Starlitnightly · 2026-04-04T06:45:05Z

This change fixes failures caused by provider / endpoint-specific output-token parameter differences, and makes the catalog the single source of truth for those parameter names.

What changed:

Added explicit output-token parameter metadata to pantheon/utils/llm_catalog.json:
- chat_output_token_param
- responses_output_token_param
Added catalog metadata for openai/gpt-4o-mini, including max_output_tokens=16384, to avoid falling back to the old default of 32000 and triggering completion-token limit errors.
Added get_output_token_param() in pantheon/utils/provider_registry.py. This helper now reads only from the catalog and returns None if the catalog does not define a parameter name.
Updated pantheon/utils/llm.py so output-token parameters are injected only when the catalog explicitly defines them:
- Chat path uses the catalog-defined chat parameter name
- Responses path uses the catalog-defined responses parameter name
- No implicit fallback guessing is performed when the catalog does not specify a field
Removed provider-level fallback hardcoding for output-token parameter names from the runtime path, so catalog metadata is now the only source used for automatic parameter injection.

Tests updated / added:

Updated tests/test_agent.py test_tool_timeout
- The old version depended on a live LLM call and no longer matched the current timeout-to-background behavior
- The new version tests the tool-dispatch layer directly and verifies timeout adoption into background execution
Added coverage in tests/test_provider_adapters.py for:
- catalog output-token parameter mapping
- gpt-4o-mini catalog metadata
- llm.py injecting max_completion_tokens from catalog metadata

Verification:

Relevant regression subset passed: 16 passed
py_compile passed

Design outcome:

Output-token parameter naming now comes from the catalog as the single source of truth
Runtime no longer carries provider-level fallback hardcoded parameter-name mappings
If the catalog does not define an output-token parameter for a provider / endpoint, Pantheon no longer guesses one during automatic injection

hazelian0619 and others added 10 commits April 3, 2026 21:57

fix(prompts): remove Two-Phase Protocol reference from delegation fai…

5e89899

…lure recovery

fix(prompts): generalize scientific illustrator description for non-b…

54ed89b

…io domains

docs(file_manager): improve write_file docstring with usage guidance

bf1a2ba

refactor(prompts): remove failure recovery section from delegation.md

55342fb

Eliminated the detailed failure recovery guidelines for tool and sub-agent errors from delegation.md to streamline the document. This change simplifies the prompts and focuses on essential delegation instructions.

Starlitnightly requested a review from zqbake April 4, 2026 05:59

Starlitnightly added 3 commits April 3, 2026 23:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(file_manager): output-token truncation guards + append_file tool (clean cherry-pick from #55)#64

fix(file_manager): output-token truncation guards + append_file tool (clean cherry-pick from #55)#64
Starlitnightly wants to merge 13 commits intomainfrom
fix/file-manager-guards

Starlitnightly commented Apr 4, 2026 •

edited

Loading

Uh oh!

Starlitnightly commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Starlitnightly commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem with PR #55

Solution

Changes

What's NOT included (intentionally)

Test plan

Uh oh!

Starlitnightly commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Starlitnightly commented Apr 4, 2026 •

edited

Loading