This repository contains a small, containerized LLM routing demo. It shows how to:
- Expose a simple API that accepts prompts.
- Route requests using a learned bandit-style router.
- Serve generations from a Hugging Face model worker.
- Train and retrain the router model offline.
- Monitor the system with Prometheus and track experiments with MLflow.
All components are wired together via docker-compose.
-
apiservice- FastAPI application (
api/main.py). - Accepts a prompt on
POST /query. - Extracts simple numeric features from the prompt.
- Calls the router service to choose an action.
- Calls the model-worker to generate text.
- Logs chosen actions to Redis and exposes Prometheus metrics on port
9000.
- FastAPI application (
-
router-service- FastAPI application (
router-service/main.py). - Uses a PyTorch
RouterModel(router-service/model.py) and athompson_samplingpolicy (router-service/policy.py). - Loads model weights from
/shared/router.ptif available. - Endpoint
POST /routetakes a list of 5 features and returns anaction(integer). - Exposes Prometheus metrics on port
9001.
- FastAPI application (
-
model-worker- FastAPI application (
model-worker/main.py). - Uses
transformers.pipeline("text-generation", model="EleutherAI/gpt-neo-125M"). - Endpoint
POST /generatetakes apromptand returns generated text. - Exposes Prometheus metrics on port
9002.
- FastAPI application (
-
ml(training)- Contains
ml/model.pywith theRouterModeldefinition. ml/train.py:- Generates synthetic data (features, actions, rewards).
- Trains the router model with PyTorch.
- Logs metrics and artifacts to MLflow.
- Saves the trained model to
/shared/router.pt.
- Contains
-
retrainer- Service/Dockerfile intended to run retraining flows using the shared volume.
-
Infrastructure / supporting services
- Redis: used by the API to log actions.
- MLflow (
mlflowservice): experiment tracking UI on port5000. - Prometheus (
prometheusservice): metrics scraping, configured viamonitoring/prometheus.yml, UI on port9090. base-ml: base image used by ML-related services.- All services share a Docker volume named
sharedfor model artifacts (e.g.router.pt).
The top-level docker-compose.yaml defines and wires all these services together.
- Docker and Docker Compose installed.
- Sufficient memory to load the
EleutherAI/gpt-neo-125Mmodel (a few GB of RAM is recommended). - Internet access on first run so that the Hugging Face model can be downloaded.
git clone <your-repo-url> llmproject
cd llmprojectFrom the repository root:
docker compose up --buildThis will:
- Build the
base-ml,api,router-service,model-worker, andretrainerimages. - Start Redis, MLflow, and Prometheus.
- Create the
sharedvolume for model artifacts.
Once all containers are healthy:
- API:
http://localhost:8000 - MLflow UI:
http://localhost:5000 - Prometheus UI:
http://localhost:9090
Stop everything with:
docker compose downThe router uses a PyTorch model trained with the script in ml/train.py.
When run (typically inside a container that shares the shared volume), it will:
- Generate synthetic features, actions, and rewards.
- Train
RouterModelon this data. - Log training loss to MLflow.
- Save the trained weights to
/shared/router.pt.
Once /shared/router.pt exists, router-service will automatically load it at startup:
- If the file is present:
Loaded trained router model. - If missing:
No trained model found. Using randomly initialized model.
Depending on your workflow, training can be triggered via:
- A one-off container based on the
base-mlorretrainerimage. - A job that runs
python ml/train.pywith/sharedmounted.
- URL:
POST http://localhost:8000/query - Body (JSON-encoded string or a simple form field; current implementation expects
prompt: strdirectly):
Example using curl with JSON:
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '"Tell me a story about space exploration."'The API will:
- Extract simple numeric features from the prompt.
- Ask the router for an
action. - Ask the model worker for generated text.
- Push the chosen
actioninto Redis listlogs.
Response shape:
{
"action": <int>,
"response": "<generated text>"
}- URL:
POST http://router-service:8002/route(inside Docker network) - Body:
{
"features": [0.1, 0.5, 0.3, 0.7, 0.2]
}Response:
{
"action": 1
}- URL:
POST http://model-worker:8001/generate - Body:
{
"prompt": "Tell me a story about space exploration."
}Response:
{
"response": "Tell me a story about space exploration. ..."
}-
Prometheus metrics:
- API exports counters/histograms such as
api_requests,api_latency_secondson port9000. - Router exports
router_requestson port9001. - Model worker exports
worker_requestson port9002. monitoring/prometheus.ymlconfigures how Prometheus scrapes these endpoints.
- API exports counters/histograms such as
-
MLflow:
- Training script logs:
lossper epoch.- The
router.ptartifact.
- Access the tracking UI at
http://localhost:5000to inspect runs.
- Training script logs:
- Python dependencies for each service live in their respective
requirements.txt:api/requirements.txtrouter-service/requirements.txtmodel-worker/requirements.txtretrainer/requirements.txt
- Each major service has its own
Dockerfile. - The
sharedvolume is critical for moving trained model weights between training and inference services.
Some ideas for extending this demo:
- Hook the router’s rewards to real user feedback or latency measurements instead of synthetic data.
- Add additional model workers and extend the router to handle more actions.
- Implement scheduled retraining jobs that periodically update
/shared/router.pt. - Harden input validation and error handling on public endpoints.