feat: CC-aligned token optimization with multi-provider cache support#63
Merged
Starlitnightly merged 2 commits intomainfrom Apr 4, 2026
Merged
feat: CC-aligned token optimization with multi-provider cache support#63Starlitnightly merged 2 commits intomainfrom
Starlitnightly merged 2 commits intomainfrom
Conversation
- Added new attributes to `AgentRunContext` for improved context management, including `execution_context_id`, `cache_safe_runtime_params`, and `cache_safe_prompt_messages`. - Integrated cache-safe mechanisms for tool definitions and runtime parameters to streamline agent interactions. - Updated `process_tool_result` to prioritize per-tool length limits over global settings for better output control. - Enhanced truncation handling in `truncate.py` to support unified output formatting for large results. - Adjusted settings for maximum tool content length and file read characters to improve performance and safety. These changes aim to refine the agent's operational efficiency and ensure better handling of context and output across various tools.
- Add supports_explicit_cache_control() for Anthropic + Qwen models
(both use identical cache_control: {"type": "ephemeral"} format)
- Update agent.py to use new function for cache marker injection
- Fix autocompact LLM call to pass params via model_params dict
- Add multi-provider benchmark script (OpenAI, Gemini, DeepSeek,
Qwen, Kimi, Claude) with cache hit comparison
- Rename test script to avoid pytest auto-discovery
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Re-integrates the 5-stage CC-aligned token optimization pipeline (from PR #54 and #57) on top of main's litellm-free architecture, with new multi-provider cache support.
What's included
Token optimization pipeline restored and adapted — all 5 stages working without litellm:
acompletionwithmodel_params)Cache control extended to Qwen — new
supports_explicit_cache_control()function enablescache_control: {"type": "ephemeral"}injection for both Anthropic and Qwen models (same format).Multi-provider benchmark — new
scripts/benchmark_multi_provider.pytests token reduction + cache behavior across 6 providers.Benchmark Results
Local Token Optimization (provider-independent)
Live API Token Reduction (10x50KB scenario)
Cache Hit Comparison
Key finding:
stabilize_tool_definitions()is critical for Kimi — unstable tool ordering = 0 cache hits, stable ordering = 588 cached tokens.Cache Strategy Per Provider
cache_controlmarkers injectedcache_controlformat as Anthropicstabilize_tool_definitionsverifiedlitellm Adaptation
acompletion_litellm→acompletion(unified adapter-based routing)max_tokens/temperature→model_paramsdict (newacompletionsignature)inject_cache_control_markers(native SDK handles it)is_anthropic_modelcheck →supports_explicit_cache_control(Anthropic + Qwen)Files Changed
pantheon/agent.py— AgentRunContext cache fields, build_llm_view_async pipeline, cache marker injectionpantheon/utils/token_optimization.py— all 5 stages + litellm-free adaptations + Qwen supportpantheon/team/pantheon.py— delegation cache helpers and structured fork contextpantheon/utils/llm.py— per-tool limits in process_tool_resultpantheon/utils/truncate.py— persisted-output formatpantheon/settings.py— max_tool_content_length (50K), max_file_read_chars (500K)pantheon/toolsets/python/python_interpreter.py— shared_executor supportscripts/benchmark_multi_provider.py— NEW: 6-provider benchmarkscripts/benchmark_token_optimization_live.py— live API optimization testtests/test_token_optimization.py— 65 unit tests (all mocked, no API keys needed)Test plan
pytest tests/test_token_optimization.py— all 65 tests pass (no API keys needed)pytest tests/test_truncate.py— truncation tests passscripts/benchmark_multi_provider.pywith API keys to verify live token reductionscripts/benchmark_token_optimizations.pyfor full per-optimization breakdown