llm-router

Intelligent LLM request router with priority queues, multi-model routing, circuit breakers, and semantic caching.

Why?

Running multiple LLM models (local Ollama, cloud OpenAI, etc.) in production means dealing with:

Priority management — Critical requests shouldn't wait behind low-priority background tasks
Multi-model routing — Send quality-critical tasks to your best model, volume tasks to a cheaper one
Fault tolerance — When a model goes down, automatically failover instead of crashing
Redundant calls — Similar prompts shouldn't hit the LLM twice if the answer is already cached
Backpressure — A slow model shouldn't accept more requests than it can handle

llm-router solves all of these in a single library with a clean async Python API.

Installation

# Core (Ollama + OpenAI backends)
pip install llm-router

# With semantic caching
pip install llm-router[cache]

# Everything
pip install llm-router[all]

Quickstart

import asyncio
from llm_router import LLMRouter, Priority

async def main():
    router = LLMRouter()
    router.add_model("local", provider="ollama", model="qwen2.5:3b")

    async with router:
        response = await router.generate(
            "What is the capital of France?",
            priority=Priority.NORMAL,
        )
        print(response.content)

asyncio.run(main())

Multi-Model Routing

Route requests to different models based on priority:

router = LLMRouter()

# Register models
router.add_model("gpu-fast", provider="ollama", model="llama3:8b",
                 max_queue_depth=5)
router.add_model("cpu-cheap", provider="ollama", model="qwen2.5:3b",
                 max_queue_depth=3, num_gpu=0, num_thread=4)
router.add_model("cloud", provider="openai", model="gpt-4o",
                 api_key="sk-...")

# Define routing rules
router.add_route(Priority.CRITICAL, targets=["gpu-fast"], fallback=["cloud"])
router.add_route(Priority.HIGH,     targets=["gpu-fast"], fallback=["cpu-cheap"])
router.add_route(Priority.NORMAL,   targets=["cpu-cheap"], fallback=["gpu-fast"])
router.add_route(Priority.LOW,      targets=["cpu-cheap"])

How routing works:

Find the routing rule for the request's priority
Try each target model in order
Skip models with open circuit breakers or full queues
Fall back to fallback models if all targets fail
Return error only if every model is unavailable

Circuit Breakers

Automatically stop sending requests to failing models:

router.enable_circuit_breaker(
    failure_threshold=5,     # Open after 5 consecutive failures
    recovery_timeout=60.0,   # Wait 60s before testing recovery
    success_threshold=3,     # Close after 3 consecutive successes
)

State machine: CLOSED → (5 failures) → OPEN → (60s) → HALF_OPEN → (3 successes) → CLOSED

When a model's circuit is OPEN, the router skips it and tries the next model in the routing rule.

Semantic Caching

Cache responses and return them for semantically similar prompts:

pip install llm-router[cache]

router.enable_cache(
    similarity_threshold=0.92,  # Cosine similarity for cache hit
    ttl_hours=24,               # Cache entries expire after 24h
    max_entries=10000,           # LRU eviction beyond this limit
)

# These two prompts are semantically similar enough to share a cached response:
await router.generate("What is Python?", temperature=0.1)
await router.generate("Tell me about the Python programming language", temperature=0.1)

Note: Only responses with temperature <= 0.3 are cached, since higher temperatures produce non-deterministic outputs.

Uses sentence-transformers (all-MiniLM-L6-v2) for embedding-based similarity matching with SQLite persistence.

Backpressure Protection

Prevent queue overload by rejecting low-priority requests when the system is busy:

router.add_model("local", provider="ollama", model="llama3:8b",
                 max_queue_depth=5)  # Reject when 5+ pending

The router automatically skips models at their queue depth limit and tries the next candidate. Critical and high-priority requests are always preferred over normal and low.

Custom Backends

Integrate any LLM provider by implementing the Backend protocol:

from llm_router import LLMRouter, Backend, LLMRequest, LLMResponse, register_backend

class MyCustomBackend:
    async def generate(self, request: LLMRequest, model: str, **kwargs) -> LLMResponse:
        result = await my_api.complete(model, request.prompt)
        return LLMResponse(
            content=result.text,
            model_used=model,
            success=True,
            latency_ms=result.duration_ms,
        )

    async def health_check(self) -> bool:
        return await my_api.ping()

    async def close(self) -> None:
        await my_api.disconnect()

# Register globally
register_backend("my_provider", MyCustomBackend)
router.add_model("custom", provider="my_provider", model="my-model")

# Or pass directly
router.add_model("custom", provider="custom", model="my-model",
                 backend=MyCustomBackend())

Observability

# Router-wide stats
router.get_stats()
# {
#     "total_requests": 150,
#     "successes": 145,
#     "failures": 5,
#     "success_rate": 0.967,
#     "cache_hits": 23,
#     "avg_latency_ms": 340.5,
#     "requests_by_model": {"gpu-fast": 80, "cpu-cheap": 70},
#     "requests_by_priority": {"CRITICAL": 10, "HIGH": 40, "NORMAL": 80, "LOW": 20},
# }

# Per-model health
router.get_model_status()
# {
#     "gpu-fast": {
#         "provider": "ollama",
#         "model": "llama3:8b",
#         "queue_depth": 2,
#         "circuit_breaker": {"state": "closed", "failure_count": 0},
#     },
# }

FastAPI Proxy Server

See examples/fastapi_server.py for a full REST API proxy that routes LLM requests through the router:

pip install llm-router fastapi uvicorn
uvicorn examples.fastapi_server:app

curl -X POST http://localhost:8000/generate \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Hello!", "priority": "high"}'

Architecture

┌─────────────────────────────────────────────────────┐
│                    LLMRouter                         │
│                                                      │
│  ┌──────────┐   ┌──────────────┐   ┌─────────────┐ │
│  │ Semantic  │   │   Routing    │   │   Stats     │ │
│  │  Cache    │──>│   Engine     │──>│  Tracker    │ │
│  └──────────┘   └──────┬───────┘   └─────────────┘ │
│                         │                            │
│         ┌───────────────┼───────────────┐            │
│         ▼               ▼               ▼            │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐   │
│  │  Model A    │ │  Model B    │ │  Model C    │   │
│  │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │   │
│  │ │ Circuit │ │ │ │ Circuit │ │ │ │ Circuit │ │   │
│  │ │ Breaker │ │ │ │ Breaker │ │ │ │ Breaker │ │   │
│  │ ├─────────┤ │ │ ├─────────┤ │ │ ├─────────┤ │   │
│  │ │Priority │ │ │ │Priority │ │ │ │Priority │ │   │
│  │ │ Queue   │ │ │ │ Queue   │ │ │ │ Queue   │ │   │
│  │ ├─────────┤ │ │ ├─────────┤ │ │ ├─────────┤ │   │
│  │ │ Backend │ │ │ │ Backend │ │ │ │ Backend │ │   │
│  │ │(Ollama) │ │ │ │(OpenAI) │ │ │ │(Custom) │ │   │
│  │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │   │
│  └─────────────┘ └─────────────┘ └─────────────┘   │
└─────────────────────────────────────────────────────┘

License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
src/llm_router		src/llm_router
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-router

Why?

Installation

Quickstart

Multi-Model Routing

Circuit Breakers

Semantic Caching

Backpressure Protection

Custom Backends

Observability

FastAPI Proxy Server

Architecture

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-router

Why?

Installation

Quickstart

Multi-Model Routing

Circuit Breakers

Semantic Caching

Backpressure Protection

Custom Backends

Observability

FastAPI Proxy Server

Architecture

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages