Skip to content

feat: CC-aligned token optimization with multi-provider cache support#63

Merged
Starlitnightly merged 2 commits intomainfrom
claw-v2
Apr 4, 2026
Merged

feat: CC-aligned token optimization with multi-provider cache support#63
Starlitnightly merged 2 commits intomainfrom
claw-v2

Conversation

@Starlitnightly
Copy link
Copy Markdown
Collaborator

Summary

Re-integrates the 5-stage CC-aligned token optimization pipeline (from PR #54 and #57) on top of main's litellm-free architecture, with new multi-provider cache support.

What's included

  1. Token optimization pipeline restored and adapted — all 5 stages working without litellm:

    • Stage 1: Tool Result Budget (externalize large outputs to disk)
    • Stage 2: History Snip (drop oldest messages over token budget)
    • Stage 3: Time-Based Microcompact (clear old compactable tool results)
    • Stage 4: Context Collapse (fold consecutive read/search groups)
    • Stage 5: Autocompact (LLM-based summarization, adapted to use acompletion with model_params)
  2. Cache control extended to Qwen — new supports_explicit_cache_control() function enables cache_control: {"type": "ephemeral"} injection for both Anthropic and Qwen models (same format).

  3. Multi-provider benchmark — new scripts/benchmark_multi_provider.py tests token reduction + cache behavior across 6 providers.

Benchmark Results

Local Token Optimization (provider-independent)

Optimization Medium (128K tok) Large (512K tok)
Micro-Compact 79.9% 89.9%
HISTORY_SNIP 90.0% 95.0%
contextCollapse 100.0% 100.0%
Full Pipeline 50.0% 85.0%
ALL COMBINED 79.9% 90.0%

Live API Token Reduction (10x50KB scenario)

Provider Model Raw Optimized Saved
OpenAI gpt-4.1-mini 170,827 34,287 79.9%
Gemini gemini-2.5-flash 248,316 49,757 80.0%
Kimi kimi-k2.5 170,849 34,301 79.9%

Cache Hit Comparison

Provider Unstable c2 Stable c2 Cache Type
DeepSeek 448 448 Auto disk (64-token granularity)
Kimi 0 588 Auto prefix (stable ordering required!)
OpenAI 0 0 Auto prefix (needs longer warmup)
Gemini 0 0 Implicit auto (no guarantee)

Key finding: stabilize_tool_definitions() is critical for Kimi — unstable tool ordering = 0 cache hits, stable ordering = 588 cached tokens.

Cache Strategy Per Provider

Provider Explicit Markers Strategy
OpenAI No Auto prefix — stable tool definitions help
Anthropic Yes cache_control markers injected
Gemini No Implicit auto — stable prefixes help
DeepSeek No Auto disk — 64-token granularity
Qwen Yes (NEW) Same cache_control format as Anthropic
Kimi No Auto prefix — stabilize_tool_definitions verified

litellm Adaptation

  • acompletion_litellmacompletion (unified adapter-based routing)
  • max_tokens/temperaturemodel_params dict (new acompletion signature)
  • Removed litellm version check in inject_cache_control_markers (native SDK handles it)
  • is_anthropic_model check → supports_explicit_cache_control (Anthropic + Qwen)

Files Changed

  • pantheon/agent.py — AgentRunContext cache fields, build_llm_view_async pipeline, cache marker injection
  • pantheon/utils/token_optimization.py — all 5 stages + litellm-free adaptations + Qwen support
  • pantheon/team/pantheon.py — delegation cache helpers and structured fork context
  • pantheon/utils/llm.py — per-tool limits in process_tool_result
  • pantheon/utils/truncate.py — persisted-output format
  • pantheon/settings.py — max_tool_content_length (50K), max_file_read_chars (500K)
  • pantheon/toolsets/python/python_interpreter.py — shared_executor support
  • scripts/benchmark_multi_provider.py — NEW: 6-provider benchmark
  • scripts/benchmark_token_optimization_live.py — live API optimization test
  • tests/test_token_optimization.py — 65 unit tests (all mocked, no API keys needed)

Test plan

  • pytest tests/test_token_optimization.py — all 65 tests pass (no API keys needed)
  • pytest tests/test_truncate.py — truncation tests pass
  • Run scripts/benchmark_multi_provider.py with API keys to verify live token reduction
  • Run scripts/benchmark_token_optimizations.py for full per-optimization breakdown

- Added new attributes to `AgentRunContext` for improved context management, including `execution_context_id`, `cache_safe_runtime_params`, and `cache_safe_prompt_messages`.
- Integrated cache-safe mechanisms for tool definitions and runtime parameters to streamline agent interactions.
- Updated `process_tool_result` to prioritize per-tool length limits over global settings for better output control.
- Enhanced truncation handling in `truncate.py` to support unified output formatting for large results.
- Adjusted settings for maximum tool content length and file read characters to improve performance and safety.

These changes aim to refine the agent's operational efficiency and ensure better handling of context and output across various tools.
- Add supports_explicit_cache_control() for Anthropic + Qwen models
  (both use identical cache_control: {"type": "ephemeral"} format)
- Update agent.py to use new function for cache marker injection
- Fix autocompact LLM call to pass params via model_params dict
- Add multi-provider benchmark script (OpenAI, Gemini, DeepSeek,
  Qwen, Kimi, Claude) with cache hit comparison
- Rename test script to avoid pytest auto-discovery
@Starlitnightly Starlitnightly merged commit 0335885 into main Apr 4, 2026
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant