Skip to content

Conversation

@esafwan
Copy link
Contributor

@esafwan esafwan commented Dec 31, 2025

PR: Implement LiteLLM Prompt Caching

Overview

This PR integrates LiteLLM's prompt caching capabilities into HUF. This feature allows agents to cache and reuse prompt prefixes (system instructions, tool definitions, conversation history), significantly reducing inference costs and latency for supported models (e.g., Anthropic Claude 3.5 Sonnet, GPT-4o).

Changes

1. Data Model & Validation (feat(core))

  • Agent DocType:
    • Added enable_prompt_caching field.
    • Added validation logic in Agent.validate() to check if the selected provider/model supports prompt caching using litellm.supports_prompt_caching(). warning is shown if unsupported.
  • Agent Run DocType:
    • Added cached_tokens (Int) field to track token savings per run.

2. AI Provider Logic (feat(ai))

  • Agent Integration (agent_integration.py):
    • Propagates the enable_prompt_caching setting from the Agent document to the execution context.
  • LiteLLM Provider (litellm.py):
    • Updated run() and run_stream() to pass cached_messages=True to litellm.completion.
    • Implemented logic to extract prompt_tokens_details.cached_tokens from the LLM response usage data.
    • Correctly maps cached tokens to the Agent Run document.

3. Documentation (docs)

  • Added PROMPT_CACHING_IMPLEMENTATION.md: A comprehensive guide covering:
    • Architecture explanation.
    • Supported providers (Anthropic, OpenAI, DeepSeek, etc.).
    • Configuration steps.
    • Verification and troubleshooting.

Benefits

  • Cost Efficiency: Cached tokens are typically ~90% cheaper (e.g., Anthropic charges 10% of the input price for cached reads).
  • Performance: Time-to-First-Token (TTFT) is reduced for requests hitting the cache.
  • Visibility: Users can now see exactly how many tokens were cached in the Agent Run logs.

How to Test

Prerequisites

  • Ensure litellm is updated to a version supporting prompt caching (e.g., >=1.40.0).
  • Run bench migrate to apply schema changes.

Steps

  1. Configure Agent:
    • Open an Agent document.
    • Select a model that supports caching (e.g., anthropic/claude-3-5-sonnet-20240620).
    • Check the new "Enable Prompt Caching" box.
    • Save. (Verify no warnings appear).
  2. Run 1 (Cache Creation):
    • Send a message to the agent.
    • Check the resulting Agent Run.
    • Cached Tokens should be 0 (or greater if using a shared system prompt that was already cached globally).
  3. Run 2 (Cache Read):
    • Send another message to the same agent (or same agent config).
    • Check the Agent Run.
    • Cached Tokens should be > 0.
  4. Validation Check:
    • Switch model to one that doesn't support caching (e.g., openai/gpt-3.5-turbo older versions).
    • Try to save. You should see a warning.

Commits

  • feat(core): Add prompt caching schema & validation
  • feat(ai): Implement prompt caching execution logic
  • docs: Add Prompt Caching implementation guide

esafwan and others added 7 commits December 31, 2025 20:07
- Update Agent DocType to include enable_prompt_caching field
- Update Agent Run DocType to track cached tokens usage
- Add validation logic to Agent controller to check caching support
- Update agent integration to pass caching config to provider
- Implement caching logic in LiteLLM provider
- Add logic to track and record cached write tokens in run logs
- Document prompt caching architecture and configuration
- Explain validation logic and supported providers
- Provide usage examples and implementation details
Adds prompt caching configuration to agents and tracks cached tokens. This reduces costs by reusing cached prompt content.

Co-authored-by: esafwan <esafwan@gmail.com>
- Gated Caching: Modified litellm.py to only use content array formats and
  cache_control headers if 'enable_prompt_caching' is checked in the Agent.
- Token Tracking: Fixed issue where tokens, costs, and cache showed as 0
  by correctly extracting attributes from the LiteLLM Usage object.
- Stability: Resolved SyntaxError in litellm.py by fixing incomplete
  try-except blocks.
@esafwan esafwan force-pushed the feature/litellm-prompt-caching-implementation branch from d3ae41b to 33a99b9 Compare January 5, 2026 12:45
@Sanjusha-tridz Sanjusha-tridz marked this pull request as ready for review January 5, 2026 13:08
@Sanjusha-tridz Sanjusha-tridz merged commit c02ce5a into develop Jan 5, 2026
1 of 3 checks passed
@Sanjusha-tridz Sanjusha-tridz deleted the feature/litellm-prompt-caching-implementation branch January 6, 2026 04:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants