RPS Quest is a full-stack Rock-Paper-Scissors platform that demonstrates how to run an end-to-end MLOps system on a single Kubernetes node. The service plays full games to 10 points (usually 20-30 rounds), manages 12 production model aliases, and surfaces live metrics through Grafana Cloud. You can see it in action here.
This repository contains the all that is necessary to get this running (assuming you have the ability to host some required services):
- FastAPI gameplay service,
- model definitions and hyperparameter configurations,
- model training jobs and automation setups,
- data handling and validation harnesses,
- Kubernetes manifests, and
- operational tooling required to reproduce the production environment.
- Gameplay API & engine (
app/) - multi-round matches, policies, metrics, debug UI endpoints. - Feature pipeline (
app/features.py,trainer/validation_utils.py) - 50-feature contract shared by training and inference. - Model training (
trainer/,scripts/run_config_sweep.py) - three model families (feedforward NN, XGBoost, multinomial logistic) with Production/B/shadow aliases and auto-promotion. - Storage - SQLite gameplay DB, MinIO cache for the 12 live models, and DagsHub MLflow for experiment tracking and artifact retention.
- Operations toolkit (
infra/k8s/,scripts/,ops/) - manifests, deployment scripts, CronJobs, and Grafana dashboard automation.
For fuller details, read
docs/architecture.mdfor a deep dive into the topology and data flow, anddocs/operations.mdfor day-two runbooks.
| Layer | Purpose | Notes |
|---|---|---|
| API & Game Engine | FastAPI app (app/main.py, app/routes/*) serving /start_game, /play, /metrics, /ui-lite, etc. |
Game loop enforces gambits (rounds 1-3) before ML predictions; use get_model_manager() for all inference calls. |
| Feature Contract | app/features.py, shared helpers in the trainer |
Exactly 50 ordered features; index 49 remains the legacy easy-mode flag for backward compatibility. |
| Model Lifecycle | trainer/base_model.py, trainer/train_*.py, CronJob orchestrators |
Training writes artifacts to MinIO, logs runs to MLflow, assigns aliases (Production/B/shadow1/shadow2), and calls scripts/auto_promote_models.py to evaluate swaps. |
| Storage | SQLite on data-pvc, MinIO (mlflow-artifacts/ bucket), DagsHub MLflow |
MinIO caches only the 12 aliased models with .alias_<name> markers; DagsHub preserves history. |
| Observability | /metrics, Grafana dashboard (ops/grafana-dashboard.json), promotion ledger API |
Prometheus counters drive Grafana Cloud; app/routes/promotion.py exposes history for dashboards and audits. |
- Linux/macOS workstation with Git, Docker ≥24, kubectl, and Conda (or Mambaforge).
- Access to a single-node k3s cluster (4 GB RAM + 8 GB swap recommended). The production reference host is
65.21.151.52. - Credentials:
- GHCR push rights for
ghcr.io/jimmyrisk/rps-*images. - DagsHub MLflow token (set via
MLFLOW_TRACKING_TOKENor config file). - Grafana Cloud API key stored as
GRAFANA_API_KEY,ops/.grafana_api_key, or~/.config/rps/grafana_api_keyfor dashboard deployments. - MinIO access/secret keys for the in-cluster object store (
infra/k8s/minio/02-secret.yaml).
- GHCR push rights for
- DNS/TLS provisioning for your ingress host (update
infra/k8s/12-app-ingress.yaml). Cloudflare is optional but recommended for HTTPS termination.
-
Clone and set up Python tooling
git clone https://github.com/jimmyrisk/rps.git cd rps conda env create -f environment.yml conda activate rps -
Provision k3s and configure kubectl
- Install k3s on the target host (
curl -sfL https://get.k3s.io | sh -). - Copy
/etc/rancher/k3s/k3s.yamlto your workstation as~/.kube/configand update the server address to the cluster’s public IP or domain.
- Install k3s on the target host (
-
Prepare Kubernetes secrets
- Copy
infra/k8s/01-secrets.example.yamltoinfra/k8s/01-secrets.yamland fill in values for:MLFLOW_TRACKING_USERNAME/MLFLOW_TRACKING_PASSWORDMINIO_ACCESS_KEY/MINIO_SECRET_KEYPROMETHEUS_BEARER_TOKEN(Grafana Alloy scrape)- Any other environment variables referenced by
app/config.py.
- Copy
infra/k8s/minio/02-secret.example.yamltoinfra/k8s/minio/02-secret.yamland populate matching MinIO credentials. - Store TLS certificates or Cloudflare tokens as required by your ingress controller.
- Copy
-
Create namespace and persistent volumes
kubectl apply -f infra/k8s/00-namespace.yaml kubectl apply -f infra/k8s/02-pvc.yaml kubectl apply -f infra/k8s/05-model-storage-pvc.yaml kubectl apply -f infra/k8s/minio/01-pvc.yaml
-
Deploy MinIO object storage
kubectl apply -f infra/k8s/minio/02-secret.yaml kubectl apply -f infra/k8s/minio/03-deployment.yaml kubectl apply -f infra/k8s/minio/04-service.yaml kubectl apply -f infra/k8s/minio/05-setup-job.yaml kubectl -n mlops-poc logs job/minio-setup --tail=40
The setup job provisions the
mlflow-artifactsbucket and applies lifecycle rules for alias markers. -
Build and push container images
./scripts/build_push_deploy.sh --tag latest --push --no-deploy
This builds
Dockerfile.app,Dockerfile.ui, andDockerfile.trainerand publishes them to GHCR (latesttag by default). Supply--registryor--tagoverrides as needed. -
Deploy the gameplay API and UI
kubectl apply -f infra/k8s/10-rps-app.yaml kubectl apply -f infra/k8s/11-app-service.yaml kubectl apply -f infra/k8s/12-app-ingress.yaml kubectl apply -f infra/k8s/30-rps-ui.yaml
Update
ASSET_VERSIONin the deployment manifest whenever you change files underapp/static/js/. -
Enable scheduled training and telemetry
kubectl apply -f infra/k8s/20-trainer-cronjob.yaml kubectl apply -f infra/k8s/21-individual-trainers.yaml kubectl apply -f infra/k8s/22-legacy-gameplay-cronjob.yaml
The CronJobs retrain aliases sequentially and run legacy-vs-ML matches every 30 minutes for telemetry.
-
Validate the deployment
kubectl get pods -n mlops-poc curl https://<your-host>/healthz curl https://<your-host>/metrics | grep rps_model_predictions_by_alias_total ./scripts/verify_current_state.sh
When the app pod starts for the first time, it may take a few seconds to lazily load all 12 models from MinIO.
-
Wire observability
- Run
./ops/deploy_clean_dashboard.shto publishops/grafana-dashboard.jsonto Grafana Cloud. The script reads the API key from the locations listed in “Prerequisites”. - Confirm the JSON data source
rps-promotion-ledgerpoints tohttps://<your-host>without path suffixes so relative queries resolve correctly.
- Run
At this point the platform mirrors production: gameplay endpoints, model training pipelines, MinIO cache, MLflow tracking, and dashboards are operational.
- Activate the Conda environment (
conda activate rps). - Run the fast safety net:
python tests/test_essential.py(~5 s). - Start the API locally if you need interactive debugging:
uvicorn app.main:app --host 0.0.0.0 --port 8080 --reload. - Populate metrics or seed dashboards with the validator:
python scripts/validate.py --games 500 --batch 50. - Extended validation options:
python tests/test_phased_validation_v2.py --all --verbosepython scripts/capture_cal_forrest_dataset.py --games 10 --strict-parity --sleep 0.1python scripts/verify_legacy_policies.py
- Update code and commit changes.
- Build/push new images:
./scripts/build_push_deploy.sh --tag <tag> --push --deploy(defaults tolatest). - If you touched
app/static/js/, incrementASSET_VERSIONininfra/k8s/10-rps-app.yaml(or runkubectl -n mlops-poc set env deployment/rps-app ASSET_VERSION=<ts>). - Verify rollout:
kubectl -n mlops-poc rollout status deployment/rps-appand run./scripts/verify_current_state.sh. - Verify that no ConfigMap overlays remain from previous workflows (
./scripts/clear_configmap_overlays.sh --dry-run); always ship fixes in a rebuilt image.
- Trigger the trainer CronJob manually:
kubectl create job train-manual-$(date +%s) --from=cronjob/rps-trainer -n mlops-poc - Run the JSON-driven sweep orchestrator locally:
python trainer/train_all_models.py --model-type feedforward. - Refresh existing aliases with continuation training:
python trainer/train_all_aliases.py --aliases Production B. - Resync artifacts to MinIO (cleans stale runs):
python scripts/sync_promoted_models_to_minio.py --clean. - Reload models in the app pod after external changes:
curl -X POST https://<your-host>/models/reload. - Training dataset uses a single cutoff date exposed as
TRAINING_DATA_SINCE_DATE(default2025-10-01T00:00:00Z); there is no rolling seven-day limit. scripts/auto_promote_models.pyevaluates Production vs B win rates (minimum three games per alias) and reorders challengers by action accuracy. Promotions are persisted through/internal/promotion/reportand MinIO alias markers.
- Health:
curl https://<your-host>/healthz - Metrics:
curl https://<your-host>/metrics | grep rps_ - Pod logs:
kubectl -n mlops-poc logs -f deploy/rps-app --tail=100 - Training counters:
curl -s https://<your-host>/metrics | grep rps_training_completed_total - Promotion ledger:
curl -s https://<your-host>/internal/promotion/history?limit=5 - Enable inference capture for parity checks:
Disable by removing the environment variables when finished.
kubectl set env deployment/rps-app -n mlops-poc \ RPS_CAPTURE_INFERENCE_DIR=/tmp/rps_inference \ RPS_CAPTURE_INFERENCE_SAMPLE_RATE=0.05 kubectl rollout status deployment/rps-app -n mlops-poc
- Gameplay data lives in SQLite on
data-pvc(/data/rps.db). - Only 12 models (3 families × 4 aliases) should be present in MinIO; alias markers
.alias_<name>encode version metadata. - DagsHub MLflow retains the full experiment history and artifacts.
- The feature contract is immutable without retraining all models; ensure any schema changes flow through both
app/features.pyand the trainer modules.
| Path | Description |
|---|---|
app/ |
FastAPI service, game engine, feature pipeline, metrics helpers, promotion ledger routes |
trainer/ |
Base model class, model-specific trainers, orchestrators, validation utilities |
scripts/ |
Deployment helpers, model sync, validation harnesses, promotion tooling |
infra/k8s/ |
Kubernetes manifests for namespace, PVCs, MinIO, app, UI, CronJobs |
docs/ |
Architecture, operations, API reference, metrics catalogue, audit logs |
ui/ |
Debug UI (served behind /ui-lite and /ui-lite-debug) |
tests/ |
Smoke tests, endpoint contracts, end-to-end validation suites |
ops/ |
Grafana dashboard JSON and deployment helpers |
docs/architecture.md- component topology, data flow, storage strategy.docs/operations.md- deployment, training, troubleshooting, and Grafana workflows.docs/metrics.md- Prometheus series and dashboard mapping (includes disable instructions if you don't need telemetry).docs/API_REFERENCE.md- endpoint catalogue..github/copilot-instructions.md- agent onboarding, sacred invariants, and coding conventions.
Legacy documentation resides in docs/archive/. Session-specific notes live under session_logs/.