A simple, Go-based alternative to the litellm proxy, without all the extra stuff you don't need! A modular reverse proxy that forwards requests to various LLM providers (OpenAI, Anthropic, Gemini) using Go and the Gorilla web toolkit.
- Multi-provider support: Full support for OpenAI, Anthropic, and Gemini
- Streaming Support: Native streaming support for all providers
- OpenAI Integration: Complete OpenAI API compatibility with
/openaiprefix - Anthropic Integration: Claude API support with
/anthropicprefix - Gemini Integration: Google Gemini API support with
/geminiprefix - Comprehensive Logging: Request/response monitoring with streaming detection
- CORS Support: Browser-based application compatibility
- Health Check: Detailed health status for all providers
- Configurable Port: Environment variable configuration (default: 9002)
- Rate Limiting (experimental): Optional request/token-based limits per user/API key/model/provider
# Get help on available commands
make help
# Install dependencies and build
make install build
# Run the proxy
make run
# Or run in development mode
make devOnce the proxy is running, you can make requests to LLM providers through the proxy:
# Health check (shows all provider statuses)
curl http://localhost:9002/health
# OpenAI Chat completions (replace YOUR_API_KEY with your actual OpenAI API key)
curl -X POST http://localhost:9002/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Hello, world!"}],
"max_tokens": 50
}'
# OpenAI Streaming
curl -X POST http://localhost:9002/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true,
"stream_options": {"include_usage": true}
}'
# Anthropic Messages
curl -X POST http://localhost:9002/anthropic/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "claude-3-sonnet-20240229",
"max_tokens": 100,
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Gemini Generate Content
curl -X POST http://localhost:9002/gemini/v1/models/gemini-pro:generateContent?key=YOUR_API_KEY \
-H "Content-Type: application/json" \
-d '{
"contents": [{"parts": [{"text": "Hello!"}]}]
}'The project includes comprehensive integration tests for all providers:
# Run all tests
make test-all
# Run tests for specific providers
make test-openai
make test-anthropic
make test-gemini
# Run health check tests only
make test-health
# Check environment variables
make env-checkTo run integration tests, you need to set up environment variables:
export OPENAI_API_KEY=your_openai_key
export ANTHROPIC_API_KEY=your_anthropic_key
export GEMINI_API_KEY=your_gemini_keyPORT: Environment variable to set the server port (default: 9002)
- Disabled by default. Enable via config: see
configs/base.ymlandconfigs/dev.yml. - Supports provisional token estimation with post-response reconciliation using
X-LLM-Input-Tokens(input tokens only). - Returns
429 Too Many RequestswithRetry-AfterandX-RateLimit-*headers when throttled. - Redis backend is currently not supported; only the in-process memory backend is available.
Minimal dev example (see configs/dev.yml for a full setup):
features:
rate_limiting:
enabled: true
backend: "memory" # single instance only
estimation:
max_sample_bytes: 20000
bytes_per_token: 4 # Fallback to request size (Content-Length based)
chars_per_token: 4 # Default for message-based estimation
# Optional per-provider overrides (recommended)
provider_chars_per_token:
openai: 5 # ~185–190 tokens per 1k chars (from scripts/token_estimation.py)
anthropic: 3 # ~290–315 tokens per 1k chars (from scripts/token_estimation.py)
limits:
requests_per_minute: 0 # 0 = unlimited (dev defaults)
tokens_per_minute: 0- We currently account for and reconcile only input tokens. Output tokens are not yet considered for rate limits/credits.
- For small JSON requests (size controlled by
max_sample_bytes), the proxy extracts textual message content via provider-specific parsers and estimates tokens by character count usingchars_per_token(with per-provider overrides). - Default per-provider values come from benchmarks produced by
scripts/token_estimation.py. You can run the script to generate your own table and override values in config. - Non-text modalities (images/videos) are not supported for estimation at this time and will fall back to credit-based only behavior essentially via
max_sample_bytes. - Optimistic first request: to avoid estimation blocking initial traffic, the first token-bearing request in a window (when current token count is zero) is allowed even if token limits would otherwise apply. Subsequent requests are enforced normally.
GET /health- Health check endpoint for all providers
POST /openai/v1/chat/completions- OpenAI chat completions endpoint (streaming supported)POST /openai/v1/completions- OpenAI completions endpoint (streaming supported)* /openai/v1/*- All other OpenAI API endpoints
POST /anthropic/v1/messages- Anthropic messages endpoint (streaming supported)* /anthropic/v1/*- All other Anthropic API endpoints
POST /gemini/v1/models/{model}:generateContent- Gemini content generation (streaming supported)POST /gemini/v1/models/{model}:streamGenerateContent- Explicit streaming endpoint* /gemini/v1/*- All other Gemini API endpoints
The proxy is built with a modular architecture:
main.go: Core server setup, middleware, and provider registrationproviders/openai.go: OpenAI-specific proxy implementation with streaming supportproviders/anthropic.go: Anthropic proxy implementation with streaming supportproviders/gemini.go: Gemini proxy implementation with streaming supportproviders/provider.go: Common interfaces and provider management
Each provider implements its own:
- Route registration
- Request/response handling with streaming support
- Error handling
- Health status reporting
- Response metadata parsing
# Get help on all available commands
make help
# Code quality
make check # Run all code quality checks
make fmt # Format Go code
make vet # Run go vet
make lint # Run golint
# Building
make build # Build the binary
make clean # Clean build artifacts
make install # Install dependencies
# Running
make run # Run the built binary
make dev # Run in development mode
# Testing
make test # Run unit tests
make test-all # Run all tests including integration
make test-openai # Run OpenAI tests only
make test-anthropic # Run Anthropic tests only
make test-gemini # Run Gemini tests onlyTests are organized by provider:
openai_test.go: OpenAI integration tests (streaming and non-streaming)anthropic_test.go: Anthropic integration tests (streaming and non-streaming)gemini_test.go: Gemini integration tests (streaming and non-streaming)common_test.go: Health check and environment variable teststest_helpers.go: Shared test utilities
- Logging: Logs all incoming requests with streaming detection
- CORS: Adds CORS headers for browser compatibility
- Streaming: Optimized handling for streaming responses
- Error Handling: Provider-specific error handling
To add a new provider:
- Create a new file (e.g.,
newprovider.go) - Implement the
Providerinterface - Add streaming detection logic
- Add response metadata parsing
- Create corresponding test file
- Register the provider in
main.go
- Gorilla Mux - HTTP router and URL matcher
The binary includes build-time information:
- Git commit hash
- Build timestamp
- Go version
View build info with:
make version