Skip to content

A protocol translation proxy that converts Anthropic Claude API format to OpenAI-compatible format for Chutes LLM backend. Supports streaming responses, tool call mapping, and automatic model discovery.

License

Notifications You must be signed in to change notification settings

takltc/claude-code-chutes-proxy

Repository files navigation

Claude-to-Chutes Proxy

Overview

  • Translates Anthropic Claude v1/messages requests to a Chutes LLM backend (OpenAI-compatible /v1/chat/completions).
  • Converts Chutes/OpenAI-style responses back to Anthropic-compatible responses.
  • Supports non-streaming and streaming (SSE) for text chats.
  • Supports tools:
    • Non-streaming: Anthropic tools + tool_use/tool_result ↔ OpenAI tool_calls/tool messages.
    • Streaming: tool_use bridging is supported (streams input_json_delta for function arguments).

Performance Notes (Important)

  • Uses a shared HTTP client with connection pooling and optional HTTP/2 to reduce handshake latency.
  • Model discovery (/v1/models) is persisted to disk by default so subsequent boots do not re-fetch.
  • Streaming function-call parsing is optional to reduce per-chunk CPU; see ENABLE_STREAM_TOOL_PARSER below.
  • Automatic context compaction keeps Anthropic histories within the configurable token budget and surfaces live token usage via response headers.

Quickstart

  • Requirements: Python 3.10+ recommended (works with 3.13), uvicorn and fastapi.
  • Env vars:
    • CHUTES_BASE_URL: Base URL of your Chutes/OpenAI backend (e.g. https://llm.chutes.ai).
    • CHUTES_API_KEY (optional): If your backend requires Bearer auth. If not set, the proxy will forward the inbound x-api-key or Authorization header to upstream.
    • MODEL_MAP (optional): JSON mapping for Anthropic→backend model names, e.g. {"claude-3.5-sonnet": "Qwen2-72B-Instruct"}.
  • DEBUG_PROXY (optional): 1/true/yes to log upstream payload metadata (helps verify model casing). The proxy preserves outward-facing casing but will auto-correct upstream model casing when enabled.
  • AUTO_FIX_MODEL_CASE (optional, default on): Auto-correct model casing against /v1/models when needed; includes a small heuristic fallback for known providers (e.g., Moonshot Kimi).
  • DISCOVERY_MIN_INTERVAL (seconds, default 300): Minimum interval to refresh model list/schemas to avoid rate limits.
  • Schema discovery: On startup (or first request with auth), the proxy queries /v1/models and tries /v1/models/{id} to build a lightweight capability map (tools/vision/reasoning). Payloads are adapted per model.
  • PROXY_BACKOFF_ON_429 (default on): For non-stream requests, honors small Retry-After and retries once.
  • Inspect discovered models: GET /_schemas.

Quick Start Options

Option 1: Docker (Recommended)

# Clone and start via Docker Compose
git clone https://github.com/takltc/claude-code-chutes-proxy
cd claude-code-chutes-proxy
docker compose up --build

# The proxy will be available at http://localhost:8090

Option 2: Local Python

# Install dependencies
python -m ensurepip --upgrade
python -m pip install -r requirements.txt

# Set environment and run
export CHUTES_BASE_URL=https://llm.chutes.ai
uvicorn app.main:app --host 0.0.0.0 --port 8090

Install

python -m ensurepip --upgrade
python -m pip install -r requirements.txt

Run

export CHUTES_BASE_URL=https://llm.chutes.ai
uvicorn app.main:app --host 0.0.0.0 --port 8090

Run with Claude Code

ANTHROPIC_BASE_URL="http://localhost:8090" ANTHROPIC_API_KEY="your-chutes-api-key" ANTHROPIC_MODEL="zai-org/GLM-4.5" ANTHROPIC_SMALL_FAST_MODEL="zai-org/GLM-4.5" CLAUDE_CODE_SUBAGENT_MODEL="zai-org/GLM-4.5" API_TIMEOUT_MS=1800000 CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 claude --dangerously-skip-permissions

Usage (Anthropic-compatible)

POST http://localhost:8090/v1/messages

Body example:

{
  "model": "claude-3.5-sonnet",
  "max_tokens": 512,
  "messages": [
    {"role": "user", "content": [{"type": "text", "text": "Hello"}]}
  ]
}

Notes

  • Text content is fully supported. Tools are supported both in non-streaming and streaming (tool_use) modes.
  • Images and multimodal: request-side user/system image blocks are translated to OpenAI image_url content entries (non-streaming). Assistant image outputs are not mapped back (rare in OpenAI response).
  • Streaming emits Anthropic-style SSE events for text deltas. Token usage is reported at end when available from backend.
  • If your Chutes backend already exposes OpenAI-compatible endpoints (e.g. vLLM/SGLang templates), you can point CHUTES_BASE_URL directly to that service.
  • Tool-call parsing: the proxy auto-selects an sglang parser per model family (LLaMA/Qwen/Mistral/DeepSeek/Kimi/GLM/GPT‑OSS). Models whose id contains longcat are parsed with the GPT‑OSS style detector, matching sglang’s approach.
  • Auto-compaction response headers broadcast:
    • X-Proxy-Context-Tokens-Before/After/Threshold
    • X-Proxy-Context-Truncated and X-Proxy-Context-Summary
    • X-Proxy-Context-Removed-Messages
    • X-Proxy-Context-Reserve-Tokens Downstream clients can surface these metrics for live telemetry and alerting.

DeepSeek :THINKING Suffix

  • If you pass a model id that ends with :THINKING (case‑insensitive), the proxy will:
    • Strip the :THINKING suffix before forwarding to the upstream backend model id.
    • Add header X-Enable-Thinking: true to the upstream request.
  • Example:
curl -sS -X POST http://localhost:8090/v1/messages \
  -H 'content-type: application/json' \
  -H 'x-api-key: YOUR_KEY' \
  -d '{
    "model": "deepseek-ai/DeepSeek-V3.1:THINKING",
    "max_tokens": 64,
    "messages": [{"role": "user", "content": [{"type": "text", "text": "Think step-by-step"}]}]
  }'

Upstream will receive JSON model as deepseek-ai/DeepSeek-V3.1 with header X-Enable-Thinking: true.

Environment Configuration

Docker Compose with .env File

Create a .env file in your project root:

CHUTES_BASE_URL=http://your-chutes-backend:8000
CHUTES_API_KEY=your-api-key-if-required
MODEL_MAP={"claude-3.5-sonnet": "Qwen2-72B-Instruct", "claude-3-haiku": "Llama-3.1-8B-Instruct"}
DEBUG_PROXY=1

Then uncomment the env_file line in docker-compose.yml:

services:
  proxy:
    # ... existing config ...
    env_file:
      - .env

Environment Variable Reference

All environment variables with their defaults and descriptions:

Variable Default Description
CHUTES_BASE_URL https://llm.chutes.ai Chutes/OpenAI-compatible backend URL
CHUTES_API_KEY - Optional API key for backend
CHUTES_AUTH_STYLE both Auth forwarding: header, env, or both
MODEL_MAP {} JSON string for model name mapping
TOOL_NAME_MAP {} JSON string for tool name mapping
AUTO_FIX_MODEL_CASE 1 Auto-correct model casing
AUTO_FIX_MODEL_CASE_PREFLIGHT 0 For streaming: preflight model-case discovery before request (adds RTT). Keep 0 for speed; 404 will retry.
DEBUG_PROXY 0 Enable request/response logging
PROXY_BACKOFF_ON_429 1 Retry on rate limiting
PROXY_MAX_RETRY_ON_429 1 Max retry attempts for 429
PROXY_MAX_RETRY_AFTER 2 Max retry-after seconds
PROXY_HTTP2 1 Enable HTTP/2 when upstream supports it (faster/less latency)
UVICORN_WORKERS 1 Uvicorn worker processes
PORT 8080 Internal container port
MODEL_DISCOVERY_TTL 300 In-memory TTL (seconds) for model list; disk persistence avoids re-fetch on restart
MODEL_DISCOVERY_PERSIST 1 Persist /v1/models results to disk for reuse across restarts
MODEL_CACHE_FILE ~/.claude-code-chutes-proxy/models_cache.json Path to models cache JSON file
ENABLE_STREAM_TOOL_PARSER 0 Enable sglang tool-call parser on streaming text (turn on only if you need inline tool markup parsing)
CHUTES_MAX_TOKENS 128000 Maximum conversation tokens allowed before compaction
CHUTES_RESPONSE_TOKEN_RESERVE 4096 Token budget reserved for the model response when callers omit max_tokens
CHUTES_MIN_CONTEXT_TOKENS 4096 Lower bound for retained conversation tokens after compaction
CHUTES_TOKEN_BUFFER_RATIO 0.85 Fraction of the effective window to target before trimming
CHUTES_TAIL_RESERVE 6 Trailing messages preserved verbatim to keep recent turns intact
CHUTES_SUMMARY_MODEL - Optional model id used for conversation summarization (defaults to the request model)
CHUTES_SUMMARY_MAX_TOKENS 1024 Max tokens allocated when generating a summary
CHUTES_SUMMARY_KEEP_LAST 4 Number of most recent messages retained after summarization
CHUTES_AUTO_CONDENSE_PERCENT 100 Context percentage threshold that triggers automatic summarization

Persistent Model Discovery

  • The proxy persists /v1/models results to a JSON file keyed by upstream URL and a light auth fingerprint.
  • Default path: ~/.claude-code-chutes-proxy/models_cache.json. Customize via MODEL_CACHE_FILE.
  • This avoids re-fetching the model list on every process start. Use the admin endpoints below to inspect/refresh/clear.

Admin Endpoints

  • GET /_models_cache — Show current cache entry (ids, ts, base_url) for the active upstream/auth.
  • POST /_models_cache/refresh — Re-fetch from upstream and persist.
  • DELETE /_models_cache — Clear the current cache entry (memory + disk). Next request will re-create it.

Recommended Settings for Speed

  • Keep AUTO_FIX_MODEL_CASE_PREFLIGHT=0 (default) to avoid a preflight /v1/models call on streaming.
  • Keep PROXY_HTTP2=1 (default) to leverage HTTP/2 if upstream supports it.
  • Keep ENABLE_STREAM_TOOL_PARSER=0 (default). Turn on only when you need inline textual tool-call parsing during stream.

Docker

  • Prebuilt image (GHCR):
    • Pull: docker pull ghcr.io/takltc/claude-code-chutes-proxy:0.0.1
    • Also available: :latest tag tracking default branch builds
    • Run:
      docker run --rm \
        -p 8090:8080 \
        -e CHUTES_BASE_URL=${CHUTES_BASE_URL:-https://llm.chutes.ai} \
        -e CHUTES_API_KEY=${CHUTES_API_KEY:-} \
        -e AUTO_FIX_MODEL_CASE=${AUTO_FIX_MODEL_CASE:-1} \
        -e DEBUG_PROXY=${DEBUG_PROXY:-0} \
        -e PROXY_BACKOFF_ON_429=${PROXY_BACKOFF_ON_429:-1} \
        -e PROXY_MAX_RETRY_ON_429=${PROXY_MAX_RETRY_ON_429:-1} \
        -e PROXY_MAX_RETRY_AFTER=${PROXY_MAX_RETRY_AFTER:-2} \
        -e CHUTES_AUTH_STYLE=${CHUTES_AUTH_STYLE:-both} \
        -e MODEL_MAP='${MODEL_MAP:-{}}' \
        -e TOOL_NAME_MAP='${TOOL_NAME_MAP:-{}}' \
        ghcr.io/takltc/claude-code-chutes-proxy:0.0.1
    • Docker Compose (use prebuilt image instead of building):
      services:
        proxy:
          image: ghcr.io/takltc/claude-code-chutes-proxy:0.0.1
          container_name: claude-chutes-proxy
          environment:
            - PORT=8080
            - CHUTES_BASE_URL=${CHUTES_BASE_URL:-https://llm.chutes.ai}
            - CHUTES_API_KEY=${CHUTES_API_KEY:-}
            - AUTO_FIX_MODEL_CASE=${AUTO_FIX_MODEL_CASE:-1}
            - DEBUG_PROXY=${DEBUG_PROXY:-0}
            - PROXY_BACKOFF_ON_429=${PROXY_BACKOFF_ON_429:-1}
            - PROXY_MAX_RETRY_ON_429=${PROXY_MAX_RETRY_ON_429:-1}
            - PROXY_MAX_RETRY_AFTER=${PROXY_MAX_RETRY_AFTER:-2}
            - CHUTES_AUTH_STYLE=${CHUTES_AUTH_STYLE:-both}
            - MODEL_MAP=${MODEL_MAP:-{}}
            - TOOL_NAME_MAP=${TOOL_NAME_MAP:-{}}
          ports:
            - "8090:8080"
          healthcheck:
            test: ["CMD-SHELL", "curl -fsS http://localhost:${PORT}/ || exit 1"]
            interval: 30s
            timeout: 5s
            retries: 3
          restart: unless-stopped
  • Build and run with Compose (local dev):
    • docker compose up --build
    • Exposes http://localhost:8090 → container 8080 (mapped from host port 8090).
    • Includes health checks with automatic restart
    • Authoritative list of configurable environment variables:
      • CHUTES_BASE_URL (default https://llm.chutes.ai) - Chutes/OpenAI backend URL
      • CHUTES_API_KEY (optional) - Backend API key
      • CHUTES_AUTH_STYLE (default both) - Auth forwarding behavior
      • MODEL_MAP (default {}) - JSON string mapping Anthropic→backend model names
      • TOOL_NAME_MAP (default {}) - JSON string mapping tool names
      • AUTO_FIX_MODEL_CASE (default 1) - Auto-correct model casing
      • DEBUG_PROXY (default 0) - Enable request/response logging
      • PROXY_BACKOFF_ON_429 (default 1) - Retry on rate limiting
      • PROXY_MAX_RETRY_ON_429 (default 1) - Max 429 retry attempts
      • PROXY_MAX_RETRY_AFTER (default 2) - Max retry-after seconds
      • UVICORN_WORKERS (default 1) - Number of Uvicorn workers
      • PORT (default 8080) - Internal container port
  • Manual Docker build/run:
    • Build: docker build -t claude-chutes-proxy .
    • Run: docker run --rm -p 8090:8080 -e CHUTES_BASE_URL=$CHUTES_BASE_URL claude-chutes-proxy
    • The container runs on port 8080 internally (exposed as 8090 on host)
    • Includes health checks every 30 seconds

Docker usage example

curl -sS -X POST http://localhost:8090/v1/messages \
  -H 'content-type: application/json' \
  -H 'x-api-key: YOUR_KEY' \
  -d '{
    "model": "claude-3.5-sonnet",
    "max_tokens": 64,
    "messages": [{"role": "user", "content": [{"type": "text", "text": "Hello"}]}]
  }'

About

A protocol translation proxy that converts Anthropic Claude API format to OpenAI-compatible format for Chutes LLM backend. Supports streaming responses, tool call mapping, and automatic model discovery.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 2

  •  
  •  

Languages