Skip to content

A high-performance load balancing gateway for LLM inference services supporting multiple backends

Notifications You must be signed in to change notification settings

KaranocaVe/llm-mux

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM-Mux: OpenAI-Compatible LLM Load Balancer

A high-performance load balancing gateway for LLM inference services supporting multiple backends (Huawei Cloud API Gateway, vLLM with Basic Auth) with health monitoring, weighted least-inflight scheduling, and session stickiness.

image

Features

  • OpenAI-Compatible API: Full support for chat/completions, completions, embeddings, and models endpoints with streaming
  • Multi-Backend Support:
    • Huawei Cloud API Gateway (APIG signature authentication)
    • vLLM with Basic Auth
    • Easy to add more backends
  • Advanced Load Balancing:
    • Weighted least-inflight scheduling
    • Session stickiness via X-Session-Key header
    • Automatic failover for unhealthy servers
  • Health Monitoring:
    • Periodic health checks (every 5 seconds)
    • EWMA latency tracking
    • Error rate statistics
    • Automatic server status updates
  • Security:
    • Global Bearer Token authentication
    • Protected admin and proxy endpoints
    • Public health check endpoints
  • Web UI:
    • Interactive token verification
    • Real-time server status monitoring
    • Server management (add/delete)
    • Live metrics display
  • Production Ready:
    • Docker containerization
    • Environment variable configuration
    • Comprehensive error handling
    • Structured logging

Quick Start

Local Development

  1. Install dependencies

    pip install -r requirements.txt
  2. Configure environment

    cp .env.example .env
    # Edit .env with your actual server credentials
    source .env
  3. Start the service

    uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
  4. Access Web UI

Docker Deployment

  1. Build image

    docker build -t llm-mux:latest .
  2. Run with docker-compose

    docker-compose up -d
  3. Access service

Configuration

Environment Variables

  • LLMMUX_AUTH_TOKEN: API authentication token (default: llm-mux-secret-token-2024)
  • LLMMUX_INITIAL_SERVERS: JSON array of initial backend servers

Server Configuration Format

{
  "name": "server-name",
  "type": "basic|huawei",
  "base_url": "http://host:port",
  "path_prefix": "/v1",
  "weight": 1,
  "max_concurrency": 16,
  "timeout_s": 60,

  // For Basic Auth
  "authorization": "Basic <base64-credentials>",

  // For Huawei
  "ak": "access-key",
  "sk": "secret-key"
}

API Usage

Authentication

All API requests require Bearer token in Authorization header:

Authorization: Bearer your-token-here

Proxy Endpoints

Chat Completions

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "model-name",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": false,
    "max_tokens": 100
  }'

List Models

curl http://localhost:8000/v1/models \
  -H "Authorization: Bearer $TOKEN"

Admin API

List Servers

curl http://localhost:8000/admin/servers \
  -H "Authorization: Bearer $TOKEN"

Add Server

curl -X POST http://localhost:8000/admin/servers \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "new-server",
    "type": "basic",
    "base_url": "http://host:port",
    "path_prefix": "/v1",
    "authorization": "Basic xxx",
    "weight": 1,
    "max_concurrency": 16,
    "timeout_s": 60
  }'

Delete Server

curl -X DELETE http://localhost:8000/admin/servers/{server_id} \
  -H "Authorization: Bearer $TOKEN"

Health Endpoints (No Auth Required)

# Liveness check
curl http://localhost:8000/healthz

# Readiness check (returns 503 if no healthy servers)
curl http://localhost:8000/readyz

Project Structure

llm-mux/
├── app/
│   ├── api/              # API routes (proxy, admin)
│   ├── adapters/         # Backend authentication adapters
│   ├── health/           # Health check monitoring
│   ├── lb/               # Load balancing logic
│   ├── security/         # Authentication middleware
│   ├── web/              # Web UI routes
│   ├── observability/    # Request ID middleware
│   ├── main.py           # FastAPI app entry point
│   ├── models.py         # Data models
│   └── repo.py           # Server registry
├── apig_sdk/             # Huawei API Gateway SDK (vendored)
├── Dockerfile            # Docker build configuration
├── docker-compose.yml    # Docker Compose setup
├── requirements.txt      # Python dependencies
├── .env.example          # Environment template
└── .gitignore            # Git ignore rules

Key Components

Load Balancer (app/lb/selector.py)

  • Weighted least-inflight scheduling
  • Session stickiness via X-Session-Key header
  • Automatic server selection with health awareness

Health Monitor (app/health/monitor.py)

  • Periodic probes to all backends
  • EWMA latency calculation
  • Error rate tracking
  • Automatic health status updates

Authentication (app/security/auth.py)

  • Bearer token validation
  • Exempt public endpoints (healthz, readyz, /ui)
  • Middleware-based enforcement

Adapters

  • Huawei (app/adapters/huawei.py): APIG signature authentication
  • vLLM Basic (app/adapters/vllm_basic.py): Basic Auth header injection

Monitoring

Server Status Indicators

  • enabled: Manual enable/disable flag
  • healthy: Automatic health status from probes
  • ewma_latency_ms: Exponential weighted moving average latency
  • error_rate: Percentage of failed requests
  • inflight: Current number of in-flight requests

Readiness Check

The /readyz endpoint returns:

  • 200 OK: At least one healthy server available
  • 503 Service Unavailable: No healthy servers available

Production Deployment

Recommendations

  1. Security

    • Use strong authentication tokens
    • Enable HTTPS with reverse proxy (Nginx)
    • Restrict IP access
  2. Performance

    • Use Gunicorn with multiple workers:
      gunicorn app.main:app -w 4 -k uvicorn.workers.UvicornWorker
    • Configure reverse proxy caching
  3. Reliability

    • Use process manager (systemd, supervisor)
    • Configure log rotation
    • Set up monitoring and alerting
  4. Observability

    • Monitor /healthz and /readyz endpoints
    • Track request latency and error rates
    • Set up structured logging

Troubleshooting

Server Shows Unhealthy

  1. Check server connectivity: curl http://server:port/v1/models
  2. Verify authentication credentials
  3. Check server logs for errors
  4. Manually disable and re-enable via Web UI

No Healthy Servers Available

  1. Verify at least one backend is running
  2. Check network connectivity between gateway and backends
  3. Review health check logs for error details
  4. Temporarily disable health checks if needed (via API)

High Latency

  1. Check backend server load
  2. Verify network latency between gateway and backends
  3. Review EWMA latency metrics in Web UI
  4. Consider adding more backend servers

About

A high-performance load balancing gateway for LLM inference services supporting multiple backends

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published