ArtPulse is an end-to-end MLOps portfolio project that demonstrates how to move AI/ML models from training to real operations.
Core capabilities:
- Real image feature extraction + model training
- MLflow experiment tracking and Model Registry alias management
- Secure FastAPI inference API (API key/JWT + rate-limit)
- Controlled model rollout (
single,canary,blue_green) - Prometheus-compatible metrics + Grafana dashboards
- Drift monitoring + retraining automation
- OIDC-protected admin surface (
/admin) + audit logging hooks - Docker + Kubernetes deployment
- Frontend operations panel (
/admin) for live status
docs/README_TR.md
LICENSE(Proprietary, All Rights Reserved)
- Production API:
https://api.wearetheartmakers.com - Swagger:
https://api.wearetheartmakers.com/docs - Ops Panel:
https://api.wearetheartmakers.com/admin
Representative KPI targets for client-facing demos:
| KPI | Manual Baseline | ArtPulse Pipeline | Measurement Source |
|---|---|---|---|
| P95 inference latency | 180-250 ms | 45-90 ms | /metrics + Grafana latency panel |
| Accuracy | 0.86-0.90 | 0.93-0.95 | MLflow metrics.accuracy |
| Macro F1 | 0.83-0.88 | 0.91-0.93 | MLflow metrics.f1_macro |
| Deploy lead time | 45-90 min | 8-15 min | CI/CD workflow duration |
| Rollback time | 15-30 min | 1-3 min | Alias rollback + rollout restart |
These are benchmark targets for this reference architecture. Final numbers depend on infra scale and workload profile.
make install
make demo-image
export ARTPULSE_API_KEY="replace-with-strong-key"
make serve ARTPULSE_API_KEY="$ARTPULSE_API_KEY"Train with image dataset and enforced pre-training checks:
make train-images DATASET_DIR=data/imagesQuality controls (Great Expectations-style gate):
- feature schema validation
- finite value check (
NaN/infrejection) - normalized feature range check (
0..1) - minimum sample + minimum per-label checks
- class imbalance warning
Artifacts:
artifacts/training_summary.jsonartifacts/data_quality_report.jsonartifacts/baseline_feature_stats.json
To bypass quality checks (not recommended in production):
.venv/bin/python -m src.train --skip-quality-checksRuntime traffic routing is controlled by env vars.
ROLLOUT_MODE=single|canary|blue_greenCANARY_TRAFFIC_PERCENT(0-100)CANDIDATE_MODEL_ALIAS(defaultchallenger)ACTIVE_COLOR=blue|greenBLUE_GREEN_TRAFFIC_PERCENT(0-100)
Example canary run (10% traffic to challenger):
make serve \
ARTPULSE_API_KEY="$ARTPULSE_API_KEY" \
ROLLOUT_MODE=canary \
CANARY_TRAFFIC_PERCENT=10 \
CANDIDATE_MODEL_ALIAS=challengerKubernetes rollout shift helper:
./scripts/rollout_shift.sh artpulse artpulse canary 25 blueStart API:
make serve ARTPULSE_API_KEY="$ARTPULSE_API_KEY"Endpoints:
GET /(live welcome + browser-based "Try it" demo flow)GET /healthGET /readyGET /metricsGET /admin(frontend ops panel)POST /demo/token(auth, mints short-lived demo token)GET /demo/status(public demo, bearer token)POST /demo/predict(public demo, bearer token)POST /reload-model(auth)GET /metadata(auth)GET /ops/summary(auth)POST /predict(auth)POST /predict-image(auth)
API key request example:
curl -sS -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-H "x-api-key: ${ARTPULSE_API_KEY}" \
-d '{"rows":[{"hue_mean":0.62,"sat_mean":0.55,"val_mean":0.70,"contrast":0.40,"edges":0.12}]}'JWT mode is also supported (HS256) by setting JWT_SECRET.
Latency histogram precision can be tuned without code changes:
HTTP_LATENCY_BUCKETS_SECONDS=0.001,0.0025,0.005,0.01,0.02,0.03,0.05,0.075,0.1,0.15,0.2,0.3,0.5,1.0,2.0
Public demo mode is controlled separately:
DEMO_ENABLED=true|falseDEMO_JWK_CURRENT_KID=<active-key-id>DEMO_JWK_KEYS_JSON=<json-map-of-kid-to-secret>DEMO_TOKEN_ISSUER=<token-issuer>DEMO_TOKEN_AUDIENCE=<token-audience>DEMO_TOKEN_TTL_SECONDS=<seconds>DEMO_RATE_LIMIT_REQUESTSDEMO_RATE_LIMIT_WINDOW_SECONDS
Corporate login mode for ops endpoints (/ops/summary) is available via trusted OIDC headers:
OPS_OIDC_TRUST_HEADERS=true|falseOPS_OIDC_ALLOWED_EMAIL_DOMAINS=wearetheartmakers.com,partner.com(optional allowlist)
This mode is designed for ingress + oauth2-proxy (x-auth-request-email / x-auth-request-user).
Ops dashboard URL:
http://localhost:8000/adminhttps://api.wearetheartmakers.com/admin
Panel shows:
- primary/secondary model URIs
- alias versions (champion/challenger + blue/green when enabled)
- rollout mode and traffic percentage
- drift score and event count
- request count, error rate, global p95 latency, inference p95 latency
- drift trend direction and delta (from
artifacts/drift_history.jsonl) - last retrain status + quality gate status
Non-technical demo flow:
- Open
https://api.wearetheartmakers.com/ - Use "Try it (No CLI)" card
- mint token -> status -> predict directly in browser
- Landing page base URL is proxy-aware (
x-forwarded-proto/x-forwarded-host) to avoidhttpmismatches behind ingress.
Start stack:
make monitoring-upServices:
- Prometheus:
http://localhost:9090 - Grafana:
http://localhost:3001(admin/admin) - API metrics:
http://localhost:8000/metrics
Dashboard file:
observability/grafana/dashboards/artpulse-api-dashboard.json
Stop stack:
make monitoring-downRun throughput/latency benchmark:
make loadtest K6_API_URL="http://host.docker.internal:8000" ARTPULSE_API_KEY="$ARTPULSE_API_KEY"
make loadtest-reportOutputs:
loadtest/k6/k6-summary.jsonartifacts/loadtest_report.md
Generate one-page business impact proof:
make business-impact-report \
PUBLIC_API_URL="https://api.wearetheartmakers.com" \
DEPLOY_LEAD_TIME_MIN="12" \
ROLLBACK_TIME_MIN="2"Output:
artifacts/business_impact_onepager.md- Details:
docs/BUSINESS_IMPACT_REPORT.md
make monitor-drift
make retrainOutputs:
artifacts/drift_report.jsonartifacts/drift_history.jsonl- new MLflow runs + optional challenger registration
Workflows in .github/workflows/:
ci.ymlbuild-image.ymlretrain.ymldeploy-promotion.yml
deploy-promotion.yml now supports rollout inputs:
rollout_strategy(single,canary,blue_green)rollout_secondary_percentcandidate_aliasactive_colorrun_quality_gate+ gate thresholds (gate_p95_ms,gate_error_rate_max,gate_drift_score_max)
When the gate fails, workflow automatically executes kubectl rollout undo and marks deployment as failed.
Manual promotion + rollback drill runbook:
docs/PROMOTION_ROLLBACK_TEST.md
Security setup guide:
docs/GITHUB_SECURE_SETUP.md
Docker:
make docker-build
make docker-runKubernetes:
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/secret.example.yaml # edit values first
make k8s-apply
make k8s-apply-public-staging
make k8s-apply-public-productionPublic URL + TLS blueprint:
docs/PUBLIC_URL_TLS_BLUEPRINT.mdk8s/cert-manager-clusterissuer.yamlk8s/ingress-staging.yamlk8s/ingress-production.yamlk8s/ingress-demo-staging.yamlk8s/ingress-demo-production.yamlk8s/ingress-oauth2-staging.yamlk8s/ingress-oauth2-production.yamlk8s/network-policy.yaml
Local staging baseline:
- API:
http://localhost:8000 - Ops panel:
http://localhost:8000/admin
Demo assets:
docs/STAGING_DEMO.mddocs/LIVE_DEMO_SCRIPT.mdexamples/staging_predict_request.jsonexamples/staging_predict_response.jsonexamples/public_demo_request.jsonexamples/public_demo_response.json
Run staging demo request:
make staging-demo API_URL="http://localhost:8000" ARTPULSE_API_KEY="$ARTPULSE_API_KEY"Run public demo endpoint locally:
make serve ARTPULSE_API_KEY="$ARTPULSE_API_KEY" DEMO_ENABLED=true ARTPULSE_DEMO_KEY="replace-signing-secret"
make demo-token API_URL="http://localhost:8000" ARTPULSE_API_KEY="$ARTPULSE_API_KEY" ARTPULSE_DEMO_SUBJECT="portfolio-client"
make demo-status API_URL="http://localhost:8000" ARTPULSE_API_KEY="$ARTPULSE_API_KEY" ARTPULSE_DEMO_SUBJECT="portfolio-client"
make demo-predict API_URL="http://localhost:8000" ARTPULSE_API_KEY="$ARTPULSE_API_KEY" ARTPULSE_DEMO_SUBJECT="portfolio-client"TLS/public smoke check helper:
make check-public-tls PUBLIC_API_URL="https://api.staging.<your-domain>" ARTPULSE_API_KEY="$ARTPULSE_API_KEY" ARTPULSE_DEMO_SUBJECT="portfolio-client"Use separate hostnames and a manual promotion gate:
- Staging API:
https://api.staging.wearetheartmakers.com - Production API:
https://api.wearetheartmakers.com
Recommended rollout flow:
- Deploy latest build to
staging(api.staging.wearetheartmakers.com). - Run staging smoke tests (
/health,/ready,/demo/token,/demo/predict,/admin). - Run load test + quality gate check in staging (latency/error/drift thresholds).
- Trigger
.github/workflows/deploy-promotion.ymlwithenvironment=productionandrun_quality_gate=true. - Promote only when gate passes; otherwise workflow auto-rolls back.
Operator checklist (what you do manually):
- Ensure staging custom domain is mapped in Northflank Networking.
- Keep staging/prod API keys and demo JWK secrets different.
- Run one full demo flow in staging before production promotion.
- Approve production promotion manually from workflow dispatch.
Set an external uptime monitor for production:
- Create a monitor with URL
https://api.wearetheartmakers.com/health. - Set check interval to
1 minute. - Mark incident threshold as at least
2 consecutive failures. - Enable alert channels:
- Email (ops inbox)
- Slack webhook/channel
- Add a second monitor for
https://api.wearetheartmakers.com/ready(optional but recommended).
Any provider works (Better Stack, UptimeRobot, Freshping, Pingdom). Keep checks HTTPS-only.
Recommended alert policy:
- Warning: 2 failed checks (about 2 min)
- Critical: 5 failed checks (about 5 min)
- Notification channels: Email + Slack webhook + on-call backup email
Detailed setup runbook:
docs/UPTIME_ALERTS_SETUP.md
docs/runbooks/INCIDENT_RUNBOOK.mddocs/runbooks/ROLLBACK_RUNBOOK.mddocs/runbooks/ONCALL_RUNBOOK.mddocs/runbooks/ESCALATION_MATRIX.md
| Package | Scope | SLA |
|---|---|---|
| Starter | Single model pipeline, secure API, baseline monitoring, monthly support | 99.0% uptime, business-hours support |
| Growth | Staging+prod, canary/blue-green, drift automation, governance | 99.5% uptime, 8x5 support, P1 <= 1h |
| Enterprise | 24/7 NOC/SRE, compliance controls, incident drills, custom ops | 99.9% uptime, 24/7 support, P1 <= 15m |
Detailed package document:
docs/COMMERCIAL_PACKAGES.mddocs/COMMERCIAL_OFFERING.md
Use this checklist as the next implementation pack.
Goal: move from registry-disabled to real champion/challenger version management.
Set in staging and production runtime env:
MLFLOW_TRACKING_URI=https://mlflow.<your-domain>
MLFLOW_REGISTRY_URI=https://mlflow.<your-domain>
USE_MODEL_REGISTRY_ALIAS=true
MODEL_NAME=artpulse-classifier
MODEL_ALIAS=champion
CANDIDATE_MODEL_ALIAS=challengerAcceptance criteria:
/healthshowsmodel_registry.enabled=true./ops/summaryshows alias versions as numeric versions (notregistry-disabled).champion/challengercan be promoted without changing app code.
Goal: corporate login for /admin and /ops/summary (instead of shared API key only).
Set in runtime env (staging first):
OPS_OIDC_TRUST_HEADERS=true
OPS_OIDC_ALLOWED_EMAIL_DOMAINS=wearetheartmakers.com
AUTH_REQUIRED=trueIngress/auth gateway requirements:
- Add
oauth2-proxyor equivalent auth layer in front of/adminand/ops/*. - Forward trusted identity header (
X-Auth-Request-Email) to ArtPulse. - Block direct public bypass to protected routes.
Acceptance criteria:
- Allowed corporate emails can open
/adminwithout manually pasting API key. - Non-allowed emails are rejected.
- Access events appear in SIEM audit logs.
Goal: detect incidents before clients report them.
Minimum monitoring baseline:
- Monitor
https://api.wearetheartmakers.com/healthevery1 minute. - Monitor
https://api.wearetheartmakers.com/readyevery1 minute. - Warning at
2consecutive failures, critical at5. - Alerts to Email + Slack (on-call channel).
- Repeat for staging with lower severity policy.
Reference runbook:
docs/UPTIME_ALERTS_SETUP.md
Goal: improve p95 precision (avoid over-coarse p95 values).
Update latency buckets in src/serve.py (HTTP_LATENCY_BUCKETS) to denser values around 5-300 ms, for example:
HTTP_LATENCY_BUCKETS = (
0.001, 0.0025, 0.005, 0.01, 0.02, 0.03, 0.05, 0.075,
0.1, 0.15, 0.2, 0.3, 0.5, 1.0, 2.0
)Optional hardening:
- Read buckets from env (e.g.
HTTP_LATENCY_BUCKETS_MS) with safe defaults.
Acceptance criteria:
/ops/summaryp95 values are stable and realistic under load./metricshistogram buckets reflect the new granularity.- Quality gate thresholds (
gate_p95_ms) map correctly to observed p95.
- Replace handcrafted image features with embedding-based features (e.g. CLIP/ViT) and compare ROI.
- Add online feature store integration to keep train/serve feature parity auditable.
- Add shadow deployment mode for zero-risk candidate evaluation before canary traffic.
- Add policy-based promotion gates (quality + latency + drift) before alias updates.
- Add customer-level usage analytics in the ops panel for business reporting.
src/train.py # training + MLflow tracking + quality gate
src/data_quality.py # schema/range/distribution quality checks
src/serve.py # API auth, rollout routing, metrics, admin panel
src/monitor_drift.py # drift score computation
src/retrain_job.py # retraining automation
src/loadtest_report.py # k6 summary -> markdown report
src/business_impact_report.py # one-page KPI proof package
loadtest/k6/predict.js # API performance test scenario
observability/docker-compose.yml
k8s/deployment.yaml
k8s/ingress-staging.yaml
k8s/ingress-production.yaml
k8s/ingress-demo-staging.yaml
k8s/ingress-demo-production.yaml
k8s/ingress-oauth2-staging.yaml
k8s/ingress-oauth2-production.yaml
k8s/cert-manager-clusterissuer.yaml
scripts/rollout_shift.sh
make testBuilt with <3 WeAreTheArtMakers