feat: CC-aligned token optimization with multi-provider cache support by Starlitnightly · Pull Request #63 · aristoteleo/PantheonOS

Starlitnightly · 2026-04-04T04:44:40Z

Summary

Re-integrates the 5-stage CC-aligned token optimization pipeline (from PR #54 and #57) on top of main's litellm-free architecture, with new multi-provider cache support.

What's included

Token optimization pipeline restored and adapted — all 5 stages working without litellm:
- Stage 1: Tool Result Budget (externalize large outputs to disk)
- Stage 2: History Snip (drop oldest messages over token budget)
- Stage 3: Time-Based Microcompact (clear old compactable tool results)
- Stage 4: Context Collapse (fold consecutive read/search groups)
- Stage 5: Autocompact (LLM-based summarization, adapted to use acompletion with model_params)
Cache control extended to Qwen — new supports_explicit_cache_control() function enables cache_control: {"type": "ephemeral"} injection for both Anthropic and Qwen models (same format).
Multi-provider benchmark — new scripts/benchmark_multi_provider.py tests token reduction + cache behavior across 6 providers.

Benchmark Results

Local Token Optimization (provider-independent)

Optimization	Medium (128K tok)	Large (512K tok)
Micro-Compact	79.9%	89.9%
HISTORY_SNIP	90.0%	95.0%
contextCollapse	100.0%	100.0%
Full Pipeline	50.0%	85.0%
ALL COMBINED	79.9%	90.0%

Live API Token Reduction (10x50KB scenario)

Provider	Model	Raw	Optimized	Saved
OpenAI	gpt-4.1-mini	170,827	34,287	79.9%
Gemini	gemini-2.5-flash	248,316	49,757	80.0%
Kimi	kimi-k2.5	170,849	34,301	79.9%

Cache Hit Comparison

Provider	Unstable c2	Stable c2	Cache Type
DeepSeek	448	448	Auto disk (64-token granularity)
Kimi	0	588	Auto prefix (stable ordering required!)
OpenAI	0	0	Auto prefix (needs longer warmup)
Gemini	0	0	Implicit auto (no guarantee)

Key finding: stabilize_tool_definitions() is critical for Kimi — unstable tool ordering = 0 cache hits, stable ordering = 588 cached tokens.

Cache Strategy Per Provider

Provider	Explicit Markers	Strategy
OpenAI	No	Auto prefix — stable tool definitions help
Anthropic	Yes	`cache_control` markers injected
Gemini	No	Implicit auto — stable prefixes help
DeepSeek	No	Auto disk — 64-token granularity
Qwen	Yes (NEW)	Same `cache_control` format as Anthropic
Kimi	No	Auto prefix — `stabilize_tool_definitions` verified

litellm Adaptation

acompletion_litellm → acompletion (unified adapter-based routing)
max_tokens/temperature → model_params dict (new acompletion signature)
Removed litellm version check in inject_cache_control_markers (native SDK handles it)
is_anthropic_model check → supports_explicit_cache_control (Anthropic + Qwen)

Files Changed

pantheon/agent.py — AgentRunContext cache fields, build_llm_view_async pipeline, cache marker injection
pantheon/utils/token_optimization.py — all 5 stages + litellm-free adaptations + Qwen support
pantheon/team/pantheon.py — delegation cache helpers and structured fork context
pantheon/utils/llm.py — per-tool limits in process_tool_result
pantheon/utils/truncate.py — persisted-output format
pantheon/settings.py — max_tool_content_length (50K), max_file_read_chars (500K)
pantheon/toolsets/python/python_interpreter.py — shared_executor support
scripts/benchmark_multi_provider.py — NEW: 6-provider benchmark
scripts/benchmark_token_optimization_live.py — live API optimization test
tests/test_token_optimization.py — 65 unit tests (all mocked, no API keys needed)

Test plan

pytest tests/test_token_optimization.py — all 65 tests pass (no API keys needed)
pytest tests/test_truncate.py — truncation tests pass
Run scripts/benchmark_multi_provider.py with API keys to verify live token reduction
Run scripts/benchmark_token_optimizations.py for full per-optimization breakdown

- Added new attributes to `AgentRunContext` for improved context management, including `execution_context_id`, `cache_safe_runtime_params`, and `cache_safe_prompt_messages`. - Integrated cache-safe mechanisms for tool definitions and runtime parameters to streamline agent interactions. - Updated `process_tool_result` to prioritize per-tool length limits over global settings for better output control. - Enhanced truncation handling in `truncate.py` to support unified output formatting for large results. - Adjusted settings for maximum tool content length and file read characters to improve performance and safety. These changes aim to refine the agent's operational efficiency and ensure better handling of context and output across various tools.

- Add supports_explicit_cache_control() for Anthropic + Qwen models (both use identical cache_control: {"type": "ephemeral"} format) - Update agent.py to use new function for cache marker injection - Fix autocompact LLM call to pass params via model_params dict - Add multi-provider benchmark script (OpenAI, Gemini, DeepSeek, Qwen, Kimi, Claude) with cache hit comparison - Rename test script to avoid pytest auto-discovery

Starlitnightly added 2 commits April 3, 2026 21:01

Starlitnightly merged commit 0335885 into main Apr 4, 2026
8 of 9 checks passed

Starlitnightly mentioned this pull request Apr 4, 2026

fix(file_manager): output-token truncation guards + append_file tool (clean cherry-pick from #55) #64

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: CC-aligned token optimization with multi-provider cache support#63

feat: CC-aligned token optimization with multi-provider cache support#63
Starlitnightly merged 2 commits intomainfrom
claw-v2

Starlitnightly commented Apr 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Starlitnightly commented Apr 4, 2026

Summary

What's included

Benchmark Results

Local Token Optimization (provider-independent)

Live API Token Reduction (10x50KB scenario)

Cache Hit Comparison

Cache Strategy Per Provider

litellm Adaptation

Files Changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant