GitHub - takltc/claude-code-chutes-proxy: A protocol translation proxy that converts Anthropic Claude API format to OpenAI-compatible format for Chutes LLM backend. Supports streaming responses, tool call mapping, and automatic model discovery.

Claude-to-Chutes Proxy

Overview

Translates Anthropic Claude v1/messages requests to a Chutes LLM backend (OpenAI-compatible /v1/chat/completions).
Converts Chutes/OpenAI-style responses back to Anthropic-compatible responses.
Supports non-streaming and streaming (SSE) for text chats.
Supports tools:
- Non-streaming: Anthropic tools + tool_use/tool_result ↔ OpenAI tool_calls/tool messages.
- Streaming: tool_use bridging is supported (streams input_json_delta for function arguments).

Performance Notes (Important)

Uses a shared HTTP client with connection pooling and optional HTTP/2 to reduce handshake latency.
Model discovery (/v1/models) is persisted to disk by default so subsequent boots do not re-fetch.
Streaming function-call parsing is optional to reduce per-chunk CPU; see ENABLE_STREAM_TOOL_PARSER below.
Automatic context compaction keeps Anthropic histories within the configurable token budget and surfaces live token usage via response headers.

Quickstart

Requirements: Python 3.10+ recommended (works with 3.13), uvicorn and fastapi.
Env vars:
- CHUTES_BASE_URL: Base URL of your Chutes/OpenAI backend (e.g. https://llm.chutes.ai).
- CHUTES_API_KEY (optional): If your backend requires Bearer auth. If not set, the proxy will forward the inbound x-api-key or Authorization header to upstream.
- MODEL_MAP (optional): JSON mapping for Anthropic→backend model names, e.g. {"claude-3.5-sonnet": "Qwen2-72B-Instruct"}.
DEBUG_PROXY (optional): 1/true/yes to log upstream payload metadata (helps verify model casing). The proxy preserves outward-facing casing but will auto-correct upstream model casing when enabled.
AUTO_FIX_MODEL_CASE (optional, default on): Auto-correct model casing against /v1/models when needed; includes a small heuristic fallback for known providers (e.g., Moonshot Kimi).
DISCOVERY_MIN_INTERVAL (seconds, default 300): Minimum interval to refresh model list/schemas to avoid rate limits.
Schema discovery: On startup (or first request with auth), the proxy queries /v1/models and tries /v1/models/{id} to build a lightweight capability map (tools/vision/reasoning). Payloads are adapted per model.
PROXY_BACKOFF_ON_429 (default on): For non-stream requests, honors small Retry-After and retries once.
Inspect discovered models: GET /_schemas.

Quick Start Options

Option 1: Docker (Recommended)

# Clone and start via Docker Compose
git clone https://github.com/takltc/claude-code-chutes-proxy
cd claude-code-chutes-proxy
docker compose up --build

# The proxy will be available at http://localhost:8090

Option 2: Local Python

# Install dependencies
python -m ensurepip --upgrade
python -m pip install -r requirements.txt

# Set environment and run
export CHUTES_BASE_URL=https://llm.chutes.ai
uvicorn app.main:app --host 0.0.0.0 --port 8090

Install

python -m ensurepip --upgrade
python -m pip install -r requirements.txt

Run

export CHUTES_BASE_URL=https://llm.chutes.ai
uvicorn app.main:app --host 0.0.0.0 --port 8090

Run with Claude Code

ANTHROPIC_BASE_URL="http://localhost:8090" ANTHROPIC_API_KEY="your-chutes-api-key" ANTHROPIC_MODEL="zai-org/GLM-4.5" ANTHROPIC_SMALL_FAST_MODEL="zai-org/GLM-4.5" CLAUDE_CODE_SUBAGENT_MODEL="zai-org/GLM-4.5" API_TIMEOUT_MS=1800000 CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 claude --dangerously-skip-permissions

Usage (Anthropic-compatible)

POST http://localhost:8090/v1/messages

Body example:

{
  "model": "claude-3.5-sonnet",
  "max_tokens": 512,
  "messages": [
    {"role": "user", "content": [{"type": "text", "text": "Hello"}]}
  ]
}

Notes

Text content is fully supported. Tools are supported both in non-streaming and streaming (tool_use) modes.
Images and multimodal: request-side user/system image blocks are translated to OpenAI image_url content entries (non-streaming). Assistant image outputs are not mapped back (rare in OpenAI response).
Streaming emits Anthropic-style SSE events for text deltas. Token usage is reported at end when available from backend.
If your Chutes backend already exposes OpenAI-compatible endpoints (e.g. vLLM/SGLang templates), you can point CHUTES_BASE_URL directly to that service.
Tool-call parsing: the proxy auto-selects an sglang parser per model family (LLaMA/Qwen/Mistral/DeepSeek/Kimi/GLM/GPT‑OSS). Models whose id contains longcat are parsed with the GPT‑OSS style detector, matching sglang’s approach.
Auto-compaction response headers broadcast:
- X-Proxy-Context-Tokens-Before/After/Threshold
- X-Proxy-Context-Truncated and X-Proxy-Context-Summary
- X-Proxy-Context-Removed-Messages
- X-Proxy-Context-Reserve-Tokens Downstream clients can surface these metrics for live telemetry and alerting.

DeepSeek :THINKING Suffix

If you pass a model id that ends with :THINKING (case‑insensitive), the proxy will:
- Strip the :THINKING suffix before forwarding to the upstream backend model id.
- Add header X-Enable-Thinking: true to the upstream request.
Example:

curl -sS -X POST http://localhost:8090/v1/messages \
  -H 'content-type: application/json' \
  -H 'x-api-key: YOUR_KEY' \
  -d '{
    "model": "deepseek-ai/DeepSeek-V3.1:THINKING",
    "max_tokens": 64,
    "messages": [{"role": "user", "content": [{"type": "text", "text": "Think step-by-step"}]}]
  }'

Upstream will receive JSON model as deepseek-ai/DeepSeek-V3.1 with header X-Enable-Thinking: true.

Environment Configuration

Docker Compose with .env File

Create a .env file in your project root:

CHUTES_BASE_URL=http://your-chutes-backend:8000
CHUTES_API_KEY=your-api-key-if-required
MODEL_MAP={"claude-3.5-sonnet": "Qwen2-72B-Instruct", "claude-3-haiku": "Llama-3.1-8B-Instruct"}
DEBUG_PROXY=1

Then uncomment the env_file line in docker-compose.yml:

services:
  proxy:
    # ... existing config ...
    env_file:
      - .env

Environment Variable Reference

All environment variables with their defaults and descriptions:

Variable	Default	Description
`CHUTES_BASE_URL`	`https://llm.chutes.ai`	Chutes/OpenAI-compatible backend URL
`CHUTES_API_KEY`	-	Optional API key for backend
`CHUTES_AUTH_STYLE`	`both`	Auth forwarding: `header`, `env`, or `both`
`MODEL_MAP`	`{}`	JSON string for model name mapping
`TOOL_NAME_MAP`	`{}`	JSON string for tool name mapping
`AUTO_FIX_MODEL_CASE`	`1`	Auto-correct model casing
`AUTO_FIX_MODEL_CASE_PREFLIGHT`	`0`	For streaming: preflight model-case discovery before request (adds RTT). Keep `0` for speed; 404 will retry.
`DEBUG_PROXY`	`0`	Enable request/response logging
`PROXY_BACKOFF_ON_429`	`1`	Retry on rate limiting
`PROXY_MAX_RETRY_ON_429`	`1`	Max retry attempts for 429
`PROXY_MAX_RETRY_AFTER`	`2`	Max retry-after seconds
`PROXY_HTTP2`	`1`	Enable HTTP/2 when upstream supports it (faster/less latency)
`UVICORN_WORKERS`	`1`	Uvicorn worker processes
`PORT`	`8080`	Internal container port
`MODEL_DISCOVERY_TTL`	`300`	In-memory TTL (seconds) for model list; disk persistence avoids re-fetch on restart
`MODEL_DISCOVERY_PERSIST`	`1`	Persist `/v1/models` results to disk for reuse across restarts
`MODEL_CACHE_FILE`	`~/.claude-code-chutes-proxy/models_cache.json`	Path to models cache JSON file
`ENABLE_STREAM_TOOL_PARSER`	`0`	Enable sglang tool-call parser on streaming text (turn on only if you need inline tool markup parsing)
`CHUTES_MAX_TOKENS`	`128000`	Maximum conversation tokens allowed before compaction
`CHUTES_RESPONSE_TOKEN_RESERVE`	`4096`	Token budget reserved for the model response when callers omit `max_tokens`
`CHUTES_MIN_CONTEXT_TOKENS`	`4096`	Lower bound for retained conversation tokens after compaction
`CHUTES_TOKEN_BUFFER_RATIO`	`0.85`	Fraction of the effective window to target before trimming
`CHUTES_TAIL_RESERVE`	`6`	Trailing messages preserved verbatim to keep recent turns intact
`CHUTES_SUMMARY_MODEL`	-	Optional model id used for conversation summarization (defaults to the request model)
`CHUTES_SUMMARY_MAX_TOKENS`	`1024`	Max tokens allocated when generating a summary
`CHUTES_SUMMARY_KEEP_LAST`	`4`	Number of most recent messages retained after summarization
`CHUTES_AUTO_CONDENSE_PERCENT`	`100`	Context percentage threshold that triggers automatic summarization

Persistent Model Discovery

The proxy persists /v1/models results to a JSON file keyed by upstream URL and a light auth fingerprint.
Default path: ~/.claude-code-chutes-proxy/models_cache.json. Customize via MODEL_CACHE_FILE.
This avoids re-fetching the model list on every process start. Use the admin endpoints below to inspect/refresh/clear.

Admin Endpoints

GET /_models_cache — Show current cache entry (ids, ts, base_url) for the active upstream/auth.
POST /_models_cache/refresh — Re-fetch from upstream and persist.
DELETE /_models_cache — Clear the current cache entry (memory + disk). Next request will re-create it.

Recommended Settings for Speed

Keep AUTO_FIX_MODEL_CASE_PREFLIGHT=0 (default) to avoid a preflight /v1/models call on streaming.
Keep PROXY_HTTP2=1 (default) to leverage HTTP/2 if upstream supports it.
Keep ENABLE_STREAM_TOOL_PARSER=0 (default). Turn on only when you need inline textual tool-call parsing during stream.

Docker

Prebuilt image (GHCR):

Pull: docker pull ghcr.io/takltc/claude-code-chutes-proxy:0.0.1
Also available: :latest tag tracking default branch builds

Run:

docker run --rm \
  -p 8090:8080 \
  -e CHUTES_BASE_URL=${CHUTES_BASE_URL:-https://llm.chutes.ai} \
  -e CHUTES_API_KEY=${CHUTES_API_KEY:-} \
  -e AUTO_FIX_MODEL_CASE=${AUTO_FIX_MODEL_CASE:-1} \
  -e DEBUG_PROXY=${DEBUG_PROXY:-0} \
  -e PROXY_BACKOFF_ON_429=${PROXY_BACKOFF_ON_429:-1} \
  -e PROXY_MAX_RETRY_ON_429=${PROXY_MAX_RETRY_ON_429:-1} \
  -e PROXY_MAX_RETRY_AFTER=${PROXY_MAX_RETRY_AFTER:-2} \
  -e CHUTES_AUTH_STYLE=${CHUTES_AUTH_STYLE:-both} \
  -e MODEL_MAP='${MODEL_MAP:-{}}' \
  -e TOOL_NAME_MAP='${TOOL_NAME_MAP:-{}}' \
  ghcr.io/takltc/claude-code-chutes-proxy:0.0.1

Docker Compose (use prebuilt image instead of building):

services:
  proxy:
    image: ghcr.io/takltc/claude-code-chutes-proxy:0.0.1
    container_name: claude-chutes-proxy
    environment:
      - PORT=8080
      - CHUTES_BASE_URL=${CHUTES_BASE_URL:-https://llm.chutes.ai}
      - CHUTES_API_KEY=${CHUTES_API_KEY:-}
      - AUTO_FIX_MODEL_CASE=${AUTO_FIX_MODEL_CASE:-1}
      - DEBUG_PROXY=${DEBUG_PROXY:-0}
      - PROXY_BACKOFF_ON_429=${PROXY_BACKOFF_ON_429:-1}
      - PROXY_MAX_RETRY_ON_429=${PROXY_MAX_RETRY_ON_429:-1}
      - PROXY_MAX_RETRY_AFTER=${PROXY_MAX_RETRY_AFTER:-2}
      - CHUTES_AUTH_STYLE=${CHUTES_AUTH_STYLE:-both}
      - MODEL_MAP=${MODEL_MAP:-{}}
      - TOOL_NAME_MAP=${TOOL_NAME_MAP:-{}}
    ports:
      - "8090:8080"
    healthcheck:
      test: ["CMD-SHELL", "curl -fsS http://localhost:${PORT}/ || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 3
    restart: unless-stopped

Build and run with Compose (local dev):
- docker compose up --build
- Exposes http://localhost:8090 → container 8080 (mapped from host port 8090).
- Includes health checks with automatic restart
- Authoritative list of configurable environment variables:
  - CHUTES_BASE_URL (default https://llm.chutes.ai) - Chutes/OpenAI backend URL
  - CHUTES_API_KEY (optional) - Backend API key
  - CHUTES_AUTH_STYLE (default both) - Auth forwarding behavior
  - MODEL_MAP (default {}) - JSON string mapping Anthropic→backend model names
  - TOOL_NAME_MAP (default {}) - JSON string mapping tool names
  - AUTO_FIX_MODEL_CASE (default 1) - Auto-correct model casing
  - DEBUG_PROXY (default 0) - Enable request/response logging
  - PROXY_BACKOFF_ON_429 (default 1) - Retry on rate limiting
  - PROXY_MAX_RETRY_ON_429 (default 1) - Max 429 retry attempts
  - PROXY_MAX_RETRY_AFTER (default 2) - Max retry-after seconds
  - UVICORN_WORKERS (default 1) - Number of Uvicorn workers
  - PORT (default 8080) - Internal container port
Manual Docker build/run:
- Build: docker build -t claude-chutes-proxy .
- Run: docker run --rm -p 8090:8080 -e CHUTES_BASE_URL=$CHUTES_BASE_URL claude-chutes-proxy
- The container runs on port 8080 internally (exposed as 8090 on host)
- Includes health checks every 30 seconds

Docker usage example

curl -sS -X POST http://localhost:8090/v1/messages \
  -H 'content-type: application/json' \
  -H 'x-api-key: YOUR_KEY' \
  -d '{
    "model": "claude-3.5-sonnet",
    "max_tokens": 64,
    "messages": [{"role": "user", "content": [{"type": "text", "text": "Hello"}]}]
  }'

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
app		app
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
QWEN.md		QWEN.md
README.md		README.md
VERSION		VERSION
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Option 1: Docker (Recommended)

Option 2: Local Python

Docker Compose with .env File

Environment Variable Reference

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

takltc/claude-code-chutes-proxy

Folders and files

Latest commit

History

Repository files navigation

Option 1: Docker (Recommended)

Option 2: Local Python

Docker Compose with .env File

Environment Variable Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages