GitHub - sesh10/Distributed-LLM-Routing-Model-Lifecycle

LLM Routing Demo Project

This repository contains a small, containerized LLM routing demo. It shows how to:

Expose a simple API that accepts prompts.
Route requests using a learned bandit-style router.
Serve generations from a Hugging Face model worker.
Train and retrain the router model offline.
Monitor the system with Prometheus and track experiments with MLflow.

All components are wired together via docker-compose.

Architecture Overview

api service
- FastAPI application (api/main.py).
- Accepts a prompt on POST /query.
- Extracts simple numeric features from the prompt.
- Calls the router service to choose an action.
- Calls the model-worker to generate text.
- Logs chosen actions to Redis and exposes Prometheus metrics on port 9000.
router-service
- FastAPI application (router-service/main.py).
- Uses a PyTorch RouterModel (router-service/model.py) and a thompson_sampling policy (router-service/policy.py).
- Loads model weights from /shared/router.pt if available.
- Endpoint POST /route takes a list of 5 features and returns an action (integer).
- Exposes Prometheus metrics on port 9001.
model-worker
- FastAPI application (model-worker/main.py).
- Uses transformers.pipeline("text-generation", model="EleutherAI/gpt-neo-125M").
- Endpoint POST /generate takes a prompt and returns generated text.
- Exposes Prometheus metrics on port 9002.
ml (training)
- Contains ml/model.py with the RouterModel definition.
- ml/train.py:
  - Generates synthetic data (features, actions, rewards).
  - Trains the router model with PyTorch.
  - Logs metrics and artifacts to MLflow.
  - Saves the trained model to /shared/router.pt.
retrainer
- Service/Dockerfile intended to run retraining flows using the shared volume.
Infrastructure / supporting services
- Redis: used by the API to log actions.
- MLflow (mlflow service): experiment tracking UI on port 5000.
- Prometheus (prometheus service): metrics scraping, configured via monitoring/prometheus.yml, UI on port 9090.
- base-ml: base image used by ML-related services.
- All services share a Docker volume named shared for model artifacts (e.g. router.pt).

The top-level docker-compose.yaml defines and wires all these services together.

Getting Started

Prerequisites

Docker and Docker Compose installed.
Sufficient memory to load the EleutherAI/gpt-neo-125M model (a few GB of RAM is recommended).
Internet access on first run so that the Hugging Face model can be downloaded.

Clone the Repository

git clone <your-repo-url> llmproject
cd llmproject

Running the Stack

From the repository root:

docker compose up --build

This will:

Build the base-ml, api, router-service, model-worker, and retrainer images.
Start Redis, MLflow, and Prometheus.
Create the shared volume for model artifacts.

Once all containers are healthy:

API: http://localhost:8000
MLflow UI: http://localhost:5000
Prometheus UI: http://localhost:9090

Stop everything with:

docker compose down

Training the Router Model

The router uses a PyTorch model trained with the script in ml/train.py.

When run (typically inside a container that shares the shared volume), it will:

Generate synthetic features, actions, and rewards.
Train RouterModel on this data.
Log training loss to MLflow.
Save the trained weights to /shared/router.pt.

Once /shared/router.pt exists, router-service will automatically load it at startup:

If the file is present: Loaded trained router model.
If missing: No trained model found. Using randomly initialized model.

Depending on your workflow, training can be triggered via:

A one-off container based on the base-ml or retrainer image.
A job that runs python ml/train.py with /shared mounted.

API Usage

1. Query Endpoint (Front API)

URL: POST http://localhost:8000/query
Body (JSON-encoded string or a simple form field; current implementation expects prompt: str directly):

Example using curl with JSON:

curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '"Tell me a story about space exploration."'

The API will:

Extract simple numeric features from the prompt.
Ask the router for an action.
Ask the model worker for generated text.
Push the chosen action into Redis list logs.

Response shape:

{
  "action": <int>,
  "response": "<generated text>"
}

2. Router Endpoint (Internal)

URL: POST http://router-service:8002/route (inside Docker network)
Body:

{
  "features": [0.1, 0.5, 0.3, 0.7, 0.2]
}

Response:

{
  "action": 1
}

3. Model Worker Endpoint (Internal)

URL: POST http://model-worker:8001/generate
Body:

{
  "prompt": "Tell me a story about space exploration."
}

Response:

{
  "response": "Tell me a story about space exploration. ..."
}

Monitoring and Observability

Prometheus metrics:
- API exports counters/histograms such as api_requests, api_latency_seconds on port 9000.
- Router exports router_requests on port 9001.
- Model worker exports worker_requests on port 9002.
- monitoring/prometheus.yml configures how Prometheus scrapes these endpoints.
MLflow:
- Training script logs:
  - loss per epoch.
  - The router.pt artifact.
- Access the tracking UI at http://localhost:5000 to inspect runs.

Development Notes

Python dependencies for each service live in their respective requirements.txt:
- api/requirements.txt
- router-service/requirements.txt
- model-worker/requirements.txt
- retrainer/requirements.txt
Each major service has its own Dockerfile.
The shared volume is critical for moving trained model weights between training and inference services.

Future Improvements

Some ideas for extending this demo:

Hook the router’s rewards to real user feedback or latency measurements instead of synthetic data.
Add additional model workers and extend the router to handle more actions.
Implement scheduled retraining jobs that periodically update /shared/router.pt.
Harden input validation and error handling on public endpoints.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Routing Demo Project

Architecture Overview

Getting Started

Prerequisites

Clone the Repository

Running the Stack

Training the Router Model

API Usage

1. Query Endpoint (Front API)

2. Router Endpoint (Internal)

3. Model Worker Endpoint (Internal)

Monitoring and Observability

Development Notes

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
api		api
base-ml		base-ml
ml		ml
model-worker		model-worker
monitoring		monitoring
retrainer		retrainer
router-service		router-service
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Folders and files

Latest commit

History

Repository files navigation

LLM Routing Demo Project

Architecture Overview

Getting Started

Prerequisites

Clone the Repository

Running the Stack

Training the Router Model

API Usage

1. Query Endpoint (Front API)

2. Router Endpoint (Internal)

3. Model Worker Endpoint (Internal)

Monitoring and Observability

Development Notes

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages