Claude-to-Chutes Proxy
Overview
- Translates Anthropic Claude
v1/messagesrequests to a Chutes LLM backend (OpenAI-compatible/v1/chat/completions). - Converts Chutes/OpenAI-style responses back to Anthropic-compatible responses.
- Supports non-streaming and streaming (SSE) for text chats.
- Supports tools:
- Non-streaming: Anthropic
tools+ tool_use/tool_result ↔ OpenAI tool_calls/tool messages. - Streaming: tool_use bridging is supported (streams input_json_delta for function arguments).
- Non-streaming: Anthropic
Performance Notes (Important)
- Uses a shared HTTP client with connection pooling and optional HTTP/2 to reduce handshake latency.
- Model discovery (
/v1/models) is persisted to disk by default so subsequent boots do not re-fetch. - Streaming function-call parsing is optional to reduce per-chunk CPU; see
ENABLE_STREAM_TOOL_PARSERbelow. - Automatic context compaction keeps Anthropic histories within the configurable token budget and surfaces live token usage via response headers.
Quickstart
- Requirements: Python 3.10+ recommended (works with 3.13),
uvicornandfastapi. - Env vars:
CHUTES_BASE_URL: Base URL of your Chutes/OpenAI backend (e.g.https://llm.chutes.ai).CHUTES_API_KEY(optional): If your backend requires Bearer auth. If not set, the proxy will forward the inboundx-api-keyorAuthorizationheader to upstream.MODEL_MAP(optional): JSON mapping for Anthropic→backend model names, e.g.{"claude-3.5-sonnet": "Qwen2-72B-Instruct"}.
DEBUG_PROXY(optional):1/true/yesto log upstream payload metadata (helps verify model casing). The proxy preserves outward-facing casing but will auto-correct upstream model casing when enabled.AUTO_FIX_MODEL_CASE(optional, default on): Auto-correct model casing against/v1/modelswhen needed; includes a small heuristic fallback for known providers (e.g., Moonshot Kimi).DISCOVERY_MIN_INTERVAL(seconds, default 300): Minimum interval to refresh model list/schemas to avoid rate limits.- Schema discovery: On startup (or first request with auth), the proxy queries
/v1/modelsand tries/v1/models/{id}to build a lightweight capability map (tools/vision/reasoning). Payloads are adapted per model. PROXY_BACKOFF_ON_429(default on): For non-stream requests, honors smallRetry-Afterand retries once.- Inspect discovered models:
GET /_schemas.
Quick Start Options
# Clone and start via Docker Compose
git clone https://github.com/takltc/claude-code-chutes-proxy
cd claude-code-chutes-proxy
docker compose up --build
# The proxy will be available at http://localhost:8090# Install dependencies
python -m ensurepip --upgrade
python -m pip install -r requirements.txt
# Set environment and run
export CHUTES_BASE_URL=https://llm.chutes.ai
uvicorn app.main:app --host 0.0.0.0 --port 8090Install
python -m ensurepip --upgrade
python -m pip install -r requirements.txt
Run
export CHUTES_BASE_URL=https://llm.chutes.ai
uvicorn app.main:app --host 0.0.0.0 --port 8090
Run with Claude Code
ANTHROPIC_BASE_URL="http://localhost:8090" ANTHROPIC_API_KEY="your-chutes-api-key" ANTHROPIC_MODEL="zai-org/GLM-4.5" ANTHROPIC_SMALL_FAST_MODEL="zai-org/GLM-4.5" CLAUDE_CODE_SUBAGENT_MODEL="zai-org/GLM-4.5" API_TIMEOUT_MS=1800000 CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 claude --dangerously-skip-permissions
Usage (Anthropic-compatible)
POST http://localhost:8090/v1/messages
Body example:
{
"model": "claude-3.5-sonnet",
"max_tokens": 512,
"messages": [
{"role": "user", "content": [{"type": "text", "text": "Hello"}]}
]
}
Notes
- Text content is fully supported. Tools are supported both in non-streaming and streaming (tool_use) modes.
- Images and multimodal: request-side user/system image blocks are translated to OpenAI
image_urlcontent entries (non-streaming). Assistant image outputs are not mapped back (rare in OpenAI response). - Streaming emits Anthropic-style SSE events for text deltas. Token usage is reported at end when available from backend.
- If your Chutes backend already exposes OpenAI-compatible endpoints (e.g. vLLM/SGLang templates), you can point
CHUTES_BASE_URLdirectly to that service. - Tool-call parsing: the proxy auto-selects an sglang parser per model family (LLaMA/Qwen/Mistral/DeepSeek/Kimi/GLM/GPT‑OSS). Models whose id contains
longcatare parsed with the GPT‑OSS style detector, matching sglang’s approach. - Auto-compaction response headers broadcast:
X-Proxy-Context-Tokens-Before/After/ThresholdX-Proxy-Context-TruncatedandX-Proxy-Context-SummaryX-Proxy-Context-Removed-MessagesX-Proxy-Context-Reserve-TokensDownstream clients can surface these metrics for live telemetry and alerting.
DeepSeek :THINKING Suffix
- If you pass a model id that ends with
:THINKING(case‑insensitive), the proxy will:- Strip the
:THINKINGsuffix before forwarding to the upstream backend model id. - Add header
X-Enable-Thinking: trueto the upstream request.
- Strip the
- Example:
curl -sS -X POST http://localhost:8090/v1/messages \
-H 'content-type: application/json' \
-H 'x-api-key: YOUR_KEY' \
-d '{
"model": "deepseek-ai/DeepSeek-V3.1:THINKING",
"max_tokens": 64,
"messages": [{"role": "user", "content": [{"type": "text", "text": "Think step-by-step"}]}]
}'Upstream will receive JSON model as deepseek-ai/DeepSeek-V3.1 with header X-Enable-Thinking: true.
Environment Configuration
Create a .env file in your project root:
CHUTES_BASE_URL=http://your-chutes-backend:8000
CHUTES_API_KEY=your-api-key-if-required
MODEL_MAP={"claude-3.5-sonnet": "Qwen2-72B-Instruct", "claude-3-haiku": "Llama-3.1-8B-Instruct"}
DEBUG_PROXY=1Then uncomment the env_file line in docker-compose.yml:
services:
proxy:
# ... existing config ...
env_file:
- .envAll environment variables with their defaults and descriptions:
| Variable | Default | Description |
|---|---|---|
CHUTES_BASE_URL |
https://llm.chutes.ai |
Chutes/OpenAI-compatible backend URL |
CHUTES_API_KEY |
- | Optional API key for backend |
CHUTES_AUTH_STYLE |
both |
Auth forwarding: header, env, or both |
MODEL_MAP |
{} |
JSON string for model name mapping |
TOOL_NAME_MAP |
{} |
JSON string for tool name mapping |
AUTO_FIX_MODEL_CASE |
1 |
Auto-correct model casing |
AUTO_FIX_MODEL_CASE_PREFLIGHT |
0 |
For streaming: preflight model-case discovery before request (adds RTT). Keep 0 for speed; 404 will retry. |
DEBUG_PROXY |
0 |
Enable request/response logging |
PROXY_BACKOFF_ON_429 |
1 |
Retry on rate limiting |
PROXY_MAX_RETRY_ON_429 |
1 |
Max retry attempts for 429 |
PROXY_MAX_RETRY_AFTER |
2 |
Max retry-after seconds |
PROXY_HTTP2 |
1 |
Enable HTTP/2 when upstream supports it (faster/less latency) |
UVICORN_WORKERS |
1 |
Uvicorn worker processes |
PORT |
8080 |
Internal container port |
MODEL_DISCOVERY_TTL |
300 |
In-memory TTL (seconds) for model list; disk persistence avoids re-fetch on restart |
MODEL_DISCOVERY_PERSIST |
1 |
Persist /v1/models results to disk for reuse across restarts |
MODEL_CACHE_FILE |
~/.claude-code-chutes-proxy/models_cache.json |
Path to models cache JSON file |
ENABLE_STREAM_TOOL_PARSER |
0 |
Enable sglang tool-call parser on streaming text (turn on only if you need inline tool markup parsing) |
CHUTES_MAX_TOKENS |
128000 |
Maximum conversation tokens allowed before compaction |
CHUTES_RESPONSE_TOKEN_RESERVE |
4096 |
Token budget reserved for the model response when callers omit max_tokens |
CHUTES_MIN_CONTEXT_TOKENS |
4096 |
Lower bound for retained conversation tokens after compaction |
CHUTES_TOKEN_BUFFER_RATIO |
0.85 |
Fraction of the effective window to target before trimming |
CHUTES_TAIL_RESERVE |
6 |
Trailing messages preserved verbatim to keep recent turns intact |
CHUTES_SUMMARY_MODEL |
- | Optional model id used for conversation summarization (defaults to the request model) |
CHUTES_SUMMARY_MAX_TOKENS |
1024 |
Max tokens allocated when generating a summary |
CHUTES_SUMMARY_KEEP_LAST |
4 |
Number of most recent messages retained after summarization |
CHUTES_AUTO_CONDENSE_PERCENT |
100 |
Context percentage threshold that triggers automatic summarization |
Persistent Model Discovery
- The proxy persists
/v1/modelsresults to a JSON file keyed by upstream URL and a light auth fingerprint. - Default path:
~/.claude-code-chutes-proxy/models_cache.json. Customize viaMODEL_CACHE_FILE. - This avoids re-fetching the model list on every process start. Use the admin endpoints below to inspect/refresh/clear.
Admin Endpoints
GET /_models_cache— Show current cache entry (ids, ts, base_url) for the active upstream/auth.POST /_models_cache/refresh— Re-fetch from upstream and persist.DELETE /_models_cache— Clear the current cache entry (memory + disk). Next request will re-create it.
Recommended Settings for Speed
- Keep
AUTO_FIX_MODEL_CASE_PREFLIGHT=0(default) to avoid a preflight/v1/modelscall on streaming. - Keep
PROXY_HTTP2=1(default) to leverage HTTP/2 if upstream supports it. - Keep
ENABLE_STREAM_TOOL_PARSER=0(default). Turn on only when you need inline textual tool-call parsing during stream.
Docker
- Prebuilt image (GHCR):
- Pull:
docker pull ghcr.io/takltc/claude-code-chutes-proxy:0.0.1 - Also available:
:latesttag tracking default branch builds - Run:
docker run --rm \ -p 8090:8080 \ -e CHUTES_BASE_URL=${CHUTES_BASE_URL:-https://llm.chutes.ai} \ -e CHUTES_API_KEY=${CHUTES_API_KEY:-} \ -e AUTO_FIX_MODEL_CASE=${AUTO_FIX_MODEL_CASE:-1} \ -e DEBUG_PROXY=${DEBUG_PROXY:-0} \ -e PROXY_BACKOFF_ON_429=${PROXY_BACKOFF_ON_429:-1} \ -e PROXY_MAX_RETRY_ON_429=${PROXY_MAX_RETRY_ON_429:-1} \ -e PROXY_MAX_RETRY_AFTER=${PROXY_MAX_RETRY_AFTER:-2} \ -e CHUTES_AUTH_STYLE=${CHUTES_AUTH_STYLE:-both} \ -e MODEL_MAP='${MODEL_MAP:-{}}' \ -e TOOL_NAME_MAP='${TOOL_NAME_MAP:-{}}' \ ghcr.io/takltc/claude-code-chutes-proxy:0.0.1
- Docker Compose (use prebuilt image instead of building):
services: proxy: image: ghcr.io/takltc/claude-code-chutes-proxy:0.0.1 container_name: claude-chutes-proxy environment: - PORT=8080 - CHUTES_BASE_URL=${CHUTES_BASE_URL:-https://llm.chutes.ai} - CHUTES_API_KEY=${CHUTES_API_KEY:-} - AUTO_FIX_MODEL_CASE=${AUTO_FIX_MODEL_CASE:-1} - DEBUG_PROXY=${DEBUG_PROXY:-0} - PROXY_BACKOFF_ON_429=${PROXY_BACKOFF_ON_429:-1} - PROXY_MAX_RETRY_ON_429=${PROXY_MAX_RETRY_ON_429:-1} - PROXY_MAX_RETRY_AFTER=${PROXY_MAX_RETRY_AFTER:-2} - CHUTES_AUTH_STYLE=${CHUTES_AUTH_STYLE:-both} - MODEL_MAP=${MODEL_MAP:-{}} - TOOL_NAME_MAP=${TOOL_NAME_MAP:-{}} ports: - "8090:8080" healthcheck: test: ["CMD-SHELL", "curl -fsS http://localhost:${PORT}/ || exit 1"] interval: 30s timeout: 5s retries: 3 restart: unless-stopped
- Pull:
- Build and run with Compose (local dev):
docker compose up --build- Exposes
http://localhost:8090→ container8080(mapped from host port 8090). - Includes health checks with automatic restart
- Authoritative list of configurable environment variables:
CHUTES_BASE_URL(defaulthttps://llm.chutes.ai) - Chutes/OpenAI backend URLCHUTES_API_KEY(optional) - Backend API keyCHUTES_AUTH_STYLE(defaultboth) - Auth forwarding behaviorMODEL_MAP(default{}) - JSON string mapping Anthropic→backend model namesTOOL_NAME_MAP(default{}) - JSON string mapping tool namesAUTO_FIX_MODEL_CASE(default1) - Auto-correct model casingDEBUG_PROXY(default0) - Enable request/response loggingPROXY_BACKOFF_ON_429(default1) - Retry on rate limitingPROXY_MAX_RETRY_ON_429(default1) - Max 429 retry attemptsPROXY_MAX_RETRY_AFTER(default2) - Max retry-after secondsUVICORN_WORKERS(default1) - Number of Uvicorn workersPORT(default8080) - Internal container port
- Manual Docker build/run:
- Build:
docker build -t claude-chutes-proxy . - Run:
docker run --rm -p 8090:8080 -e CHUTES_BASE_URL=$CHUTES_BASE_URL claude-chutes-proxy - The container runs on port 8080 internally (exposed as 8090 on host)
- Includes health checks every 30 seconds
- Build:
Docker usage example
curl -sS -X POST http://localhost:8090/v1/messages \
-H 'content-type: application/json' \
-H 'x-api-key: YOUR_KEY' \
-d '{
"model": "claude-3.5-sonnet",
"max_tokens": 64,
"messages": [{"role": "user", "content": [{"type": "text", "text": "Hello"}]}]
}'