A high-performance load balancing gateway for LLM inference services supporting multiple backends (Huawei Cloud API Gateway, vLLM with Basic Auth) with health monitoring, weighted least-inflight scheduling, and session stickiness.
- OpenAI-Compatible API: Full support for chat/completions, completions, embeddings, and models endpoints with streaming
- Multi-Backend Support:
- Huawei Cloud API Gateway (APIG signature authentication)
- vLLM with Basic Auth
- Easy to add more backends
- Advanced Load Balancing:
- Weighted least-inflight scheduling
- Session stickiness via X-Session-Key header
- Automatic failover for unhealthy servers
- Health Monitoring:
- Periodic health checks (every 5 seconds)
- EWMA latency tracking
- Error rate statistics
- Automatic server status updates
- Security:
- Global Bearer Token authentication
- Protected admin and proxy endpoints
- Public health check endpoints
- Web UI:
- Interactive token verification
- Real-time server status monitoring
- Server management (add/delete)
- Live metrics display
- Production Ready:
- Docker containerization
- Environment variable configuration
- Comprehensive error handling
- Structured logging
-
Install dependencies
pip install -r requirements.txt
-
Configure environment
cp .env.example .env # Edit .env with your actual server credentials source .env
-
Start the service
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
-
Access Web UI
- Open http://localhost:8000/ui
- Enter your API token
- Click "Verify"
-
Build image
docker build -t llm-mux:latest . -
Run with docker-compose
docker-compose up -d
-
Access service
- API: http://localhost:8000
- Web UI: http://localhost:8000/ui
LLMMUX_AUTH_TOKEN: API authentication token (default:llm-mux-secret-token-2024)LLMMUX_INITIAL_SERVERS: JSON array of initial backend servers
{
"name": "server-name",
"type": "basic|huawei",
"base_url": "http://host:port",
"path_prefix": "/v1",
"weight": 1,
"max_concurrency": 16,
"timeout_s": 60,
// For Basic Auth
"authorization": "Basic <base64-credentials>",
// For Huawei
"ak": "access-key",
"sk": "secret-key"
}All API requests require Bearer token in Authorization header:
Authorization: Bearer your-token-herecurl -X POST http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "model-name",
"messages": [{"role": "user", "content": "Hello"}],
"stream": false,
"max_tokens": 100
}'curl http://localhost:8000/v1/models \
-H "Authorization: Bearer $TOKEN"curl http://localhost:8000/admin/servers \
-H "Authorization: Bearer $TOKEN"curl -X POST http://localhost:8000/admin/servers \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "new-server",
"type": "basic",
"base_url": "http://host:port",
"path_prefix": "/v1",
"authorization": "Basic xxx",
"weight": 1,
"max_concurrency": 16,
"timeout_s": 60
}'curl -X DELETE http://localhost:8000/admin/servers/{server_id} \
-H "Authorization: Bearer $TOKEN"# Liveness check
curl http://localhost:8000/healthz
# Readiness check (returns 503 if no healthy servers)
curl http://localhost:8000/readyzllm-mux/
├── app/
│ ├── api/ # API routes (proxy, admin)
│ ├── adapters/ # Backend authentication adapters
│ ├── health/ # Health check monitoring
│ ├── lb/ # Load balancing logic
│ ├── security/ # Authentication middleware
│ ├── web/ # Web UI routes
│ ├── observability/ # Request ID middleware
│ ├── main.py # FastAPI app entry point
│ ├── models.py # Data models
│ └── repo.py # Server registry
├── apig_sdk/ # Huawei API Gateway SDK (vendored)
├── Dockerfile # Docker build configuration
├── docker-compose.yml # Docker Compose setup
├── requirements.txt # Python dependencies
├── .env.example # Environment template
└── .gitignore # Git ignore rules
- Weighted least-inflight scheduling
- Session stickiness via X-Session-Key header
- Automatic server selection with health awareness
- Periodic probes to all backends
- EWMA latency calculation
- Error rate tracking
- Automatic health status updates
- Bearer token validation
- Exempt public endpoints (healthz, readyz, /ui)
- Middleware-based enforcement
- Huawei (
app/adapters/huawei.py): APIG signature authentication - vLLM Basic (
app/adapters/vllm_basic.py): Basic Auth header injection
- enabled: Manual enable/disable flag
- healthy: Automatic health status from probes
- ewma_latency_ms: Exponential weighted moving average latency
- error_rate: Percentage of failed requests
- inflight: Current number of in-flight requests
The /readyz endpoint returns:
- 200 OK: At least one healthy server available
- 503 Service Unavailable: No healthy servers available
-
Security
- Use strong authentication tokens
- Enable HTTPS with reverse proxy (Nginx)
- Restrict IP access
-
Performance
- Use Gunicorn with multiple workers:
gunicorn app.main:app -w 4 -k uvicorn.workers.UvicornWorker
- Configure reverse proxy caching
- Use Gunicorn with multiple workers:
-
Reliability
- Use process manager (systemd, supervisor)
- Configure log rotation
- Set up monitoring and alerting
-
Observability
- Monitor /healthz and /readyz endpoints
- Track request latency and error rates
- Set up structured logging
- Check server connectivity:
curl http://server:port/v1/models - Verify authentication credentials
- Check server logs for errors
- Manually disable and re-enable via Web UI
- Verify at least one backend is running
- Check network connectivity between gateway and backends
- Review health check logs for error details
- Temporarily disable health checks if needed (via API)
- Check backend server load
- Verify network latency between gateway and backends
- Review EWMA latency metrics in Web UI
- Consider adding more backend servers