LLM-Mux: OpenAI-Compatible LLM Load Balancer

A high-performance load balancing gateway for LLM inference services supporting multiple backends (Huawei Cloud API Gateway, vLLM with Basic Auth) with health monitoring, weighted least-inflight scheduling, and session stickiness.

Features

OpenAI-Compatible API: Full support for chat/completions, completions, embeddings, and models endpoints with streaming
Multi-Backend Support:
- Huawei Cloud API Gateway (APIG signature authentication)
- vLLM with Basic Auth
- Easy to add more backends
Advanced Load Balancing:
- Weighted least-inflight scheduling
- Session stickiness via X-Session-Key header
- Automatic failover for unhealthy servers
Health Monitoring:
- Periodic health checks (every 5 seconds)
- EWMA latency tracking
- Error rate statistics
- Automatic server status updates
Security:
- Global Bearer Token authentication
- Protected admin and proxy endpoints
- Public health check endpoints
Web UI:
- Interactive token verification
- Real-time server status monitoring
- Server management (add/delete)
- Live metrics display
Production Ready:
- Docker containerization
- Environment variable configuration
- Comprehensive error handling
- Structured logging

Quick Start

Local Development

Install dependencies
```
pip install -r requirements.txt
```

Configure environment

cp .env.example .env
# Edit .env with your actual server credentials
source .env

Start the service

uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

Access Web UI
- Open http://localhost:8000/ui
- Enter your API token
- Click "Verify"

Docker Deployment

Build image
```
docker build -t llm-mux:latest .
```
Run with docker-compose
```
docker-compose up -d
```
Access service
- API: http://localhost:8000
- Web UI: http://localhost:8000/ui

Configuration

Environment Variables

LLMMUX_AUTH_TOKEN: API authentication token (default: llm-mux-secret-token-2024)
LLMMUX_INITIAL_SERVERS: JSON array of initial backend servers

Server Configuration Format

{
  "name": "server-name",
  "type": "basic|huawei",
  "base_url": "http://host:port",
  "path_prefix": "/v1",
  "weight": 1,
  "max_concurrency": 16,
  "timeout_s": 60,

  // For Basic Auth
  "authorization": "Basic <base64-credentials>",

  // For Huawei
  "ak": "access-key",
  "sk": "secret-key"
}

API Usage

Authentication

All API requests require Bearer token in Authorization header:

Authorization: Bearer your-token-here

Proxy Endpoints

Chat Completions

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "model-name",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": false,
    "max_tokens": 100
  }'

List Models

curl http://localhost:8000/v1/models \
  -H "Authorization: Bearer $TOKEN"

Admin API

List Servers

curl http://localhost:8000/admin/servers \
  -H "Authorization: Bearer $TOKEN"

Add Server

curl -X POST http://localhost:8000/admin/servers \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "new-server",
    "type": "basic",
    "base_url": "http://host:port",
    "path_prefix": "/v1",
    "authorization": "Basic xxx",
    "weight": 1,
    "max_concurrency": 16,
    "timeout_s": 60
  }'

Delete Server

curl -X DELETE http://localhost:8000/admin/servers/{server_id} \
  -H "Authorization: Bearer $TOKEN"

Health Endpoints (No Auth Required)

# Liveness check
curl http://localhost:8000/healthz

# Readiness check (returns 503 if no healthy servers)
curl http://localhost:8000/readyz

Project Structure

llm-mux/
├── app/
│   ├── api/              # API routes (proxy, admin)
│   ├── adapters/         # Backend authentication adapters
│   ├── health/           # Health check monitoring
│   ├── lb/               # Load balancing logic
│   ├── security/         # Authentication middleware
│   ├── web/              # Web UI routes
│   ├── observability/    # Request ID middleware
│   ├── main.py           # FastAPI app entry point
│   ├── models.py         # Data models
│   └── repo.py           # Server registry
├── apig_sdk/             # Huawei API Gateway SDK (vendored)
├── Dockerfile            # Docker build configuration
├── docker-compose.yml    # Docker Compose setup
├── requirements.txt      # Python dependencies
├── .env.example          # Environment template
└── .gitignore            # Git ignore rules

Key Components

Load Balancer (`app/lb/selector.py`)

Weighted least-inflight scheduling
Session stickiness via X-Session-Key header
Automatic server selection with health awareness

Health Monitor (`app/health/monitor.py`)

Periodic probes to all backends
EWMA latency calculation
Error rate tracking
Automatic health status updates

Authentication (`app/security/auth.py`)

Bearer token validation
Exempt public endpoints (healthz, readyz, /ui)
Middleware-based enforcement

Adapters

Huawei (app/adapters/huawei.py): APIG signature authentication
vLLM Basic (app/adapters/vllm_basic.py): Basic Auth header injection

Monitoring

Server Status Indicators

enabled: Manual enable/disable flag
healthy: Automatic health status from probes
ewma_latency_ms: Exponential weighted moving average latency
error_rate: Percentage of failed requests
inflight: Current number of in-flight requests

Readiness Check

The /readyz endpoint returns:

200 OK: At least one healthy server available
503 Service Unavailable: No healthy servers available

Production Deployment

Recommendations

Security
- Use strong authentication tokens
- Enable HTTPS with reverse proxy (Nginx)
- Restrict IP access
Performance
- Use Gunicorn with multiple workers:
```
gunicorn app.main:app -w 4 -k uvicorn.workers.UvicornWorker
```
- Configure reverse proxy caching
Reliability
- Use process manager (systemd, supervisor)
- Configure log rotation
- Set up monitoring and alerting
Observability
- Monitor /healthz and /readyz endpoints
- Track request latency and error rates
- Set up structured logging

Troubleshooting

Server Shows Unhealthy

Check server connectivity: curl http://server:port/v1/models
Verify authentication credentials
Check server logs for errors
Manually disable and re-enable via Web UI

No Healthy Servers Available

Verify at least one backend is running
Check network connectivity between gateway and backends
Review health check logs for error details
Temporarily disable health checks if needed (via API)

High Latency

Check backend server load
Verify network latency between gateway and backends
Review EWMA latency metrics in Web UI
Consider adding more backend servers

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
apig_sdk		apig_sdk
app		app
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

KaranocaVe/llm-mux

Folders and files

Latest commit

History

Repository files navigation