ENSAE Paris — Mise en production course
End-to-end MLOps pipeline for predicting loan approval:
- data processing
- hyperparameter tuning across three model families
- MLflow experiment tracking, FastAPI deployment
- Kubernetes orchestration on SSPCloud
- GitOps automation via ArgoCD
- Prometheus/Grafana monitoring
- SHAP explanations
- drift-triggered automatic retraining.
| Service | URL pattern |
|---|---|
| Web UI | https://loan-api-user-kbourbon.user.lab.sspcloud.fr |
| Swagger UI | https://loan-api-user-kbourbon.user.lab.sspcloud.fr/docs |
| Prometheus metrics | https://loan-api-user-kbourbon.user.lab.sspcloud.fr/metrics |
| Grafana dashboard | https://grafana-loan-user-kbourbon.user.lab.sspcloud.fr (admin / admin) |
Your username is the prefix of your SSPCloud namespace (e.g. namespace user-johndoe → username johndoe).
This section is for anyone who wants to clone the repo and get the exact same results.
| Tool | Version | Install |
|---|---|---|
| Python | 3.13 | https://www.python.org/downloads/ |
| uv | latest | curl -LsSf https://astral.sh/uv/install.sh | sh |
| Docker + Docker Compose | any recent | https://docs.docker.com/get-docker/ |
| git | any | — |
| Access to Mlflow service on SSPCloud | — | — |
| Access to a MinIO S3 bucket on SSPCloud to store the data | — | — |
Optional (Kubernetes deployment only):
kubectlconfigured against an SSPCloud cluster
Open a Mlflow service (on SSPCloud for example) and copy somewhere the following variables, that you can find during the creation of the service:
- MLFLOW_TRACKING_USERNAME
- MLFLOW_TRACKING_PASSWORD
- MLFLOW_TRACKING_URI
MLFLOW_TRACKING_URI corresponds to the http link proposed during the creation of the service
git clone https://github.com/kellybourbon2/Loan-prediction-approval.git
cd Loan-prediction-approval
uv sync # installs exact locked dependencies (uv.lock)
uv run pre-commit install # enables ruff lint+format on every commituv sync reads uv.lock — every dependency is pinned, so you get the identical environment.
The model trains on the Kaggle Playground Series S4E10 — Loan Approval Prediction dataset.
- Download
train.csvfrom https://www.kaggle.com/competitions/playground-series-s4e10/data - Upload it to your S3 bucket at the root:
s3://username/<your-bucket>/train.csv
The data loader reads it directly from S3 at training time — no local copy needed.
cp .env.example .envEdit .env with your credentials:
#S3 setting
AWS_ACCESS_KEY_ID=<your_key>
AWS_SECRET_ACCESS_KEY=<your_secret>
AWS_SESSION_TOKEN=<your_token> # leave empty if not using SSPCloud temp tokens
AWS_S3_ENDPOINT=minio.lab.sspcloud.fr
AWS_BUCKET_NAME=<your_data_bucket> # the path to the bucket where train.csv is stored
#mlflow setting
MLFLOW_TRACKING_USERNAME=<your_mlflow_username>
MLFLOW_TRACKING_URI=<your_mlflow_tracking_uri>
MLFLOW_TRACKING_PASSWORD=<your_mlflow_password>For the MLFLOW variables, put the ones you've copied in step 0.
These variables are loaded automatically by data_load.py via python-dotenv.
uv run python src/main.pyWhat happens:
| Step | Detail |
|---|---|
| Load | train.csv downloaded from S3 |
| Preprocess | DataPreprocessor: drop unused columns, bin age, binary-encode credit default, StandardScaler + OneHotEncoder |
| Split | 3-way: 80% training / 10% calibration / 10% evaluation — no leakage between the three |
| Tune | Hyperopt TPE search (MAX_EVALS=10) over XGBoost, CatBoost, RandomForest simultaneously — best model wins |
| Train | Best model retrained on full training set |
| Calibrate | CalibratedClassifierCV(method='isotonic', cv='prefit') fitted on the calibration split — well-calibrated probabilities |
| Evaluate | Accuracy, F1, Recall, Precision + confusion matrix on the eval split (never seen before) |
| Log | All metrics, params, confusion matrix PNG artifact → MLflow experiment Loan Prediction Approval Experiments |
| Register | Model registered in MLflow Registry as @challenger |
| Promote | Promoted to @champion only if F1 ≥ 0.5 and F1 > current champion (regression guard) |
To inspect runs after training, you can open manually the link corresponding to your MLFLOW_TRACKING_URI variable.
You'll see all the metrics in Model Training > "Loan Approval Experiments"
The API loads the @champion model from MLflow at startup.
uv run uvicorn src.api.app:app By default, the API is deployed on the port 8000 of your local machine. You can see visualize the app by opening the following link: http://127.0.0.1:8000
You can also request the model directly. To do so, open a new bash terminal (without closing the former one) and paste:
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{
"person_age": 30,
"person_income": 60000,
"person_home_ownership": "RENT",
"person_emp_length": 5.0,
"loan_intent": "PERSONAL",
"loan_amnt": 10000,
"loan_percent_income": 0.17,
"cb_person_default_on_file": "N",
"cb_person_cred_hist_length": 4
}'
# → {"loan_status":1,"approved":true,"probability":0.9733}Once the API requested, you can close the application by running "Ctrl + C" in the terminal where uvicorn is running.
The docker-compose.yaml manifest can be used to run the full stack (API + Prometheus + Grafana) locally. This manifest allow Prometheus and Grafana images to be pulled and local API image to be built, to create three containers where the api, Grafana and Prometheus can live independantly.
Since there is no docker on SSPCloud, open a local VSCode with docker installed on it and run:
docker compose upOpen the following links to visualise each service:
| Service | URL | Credentials |
|---|---|---|
| API | http://localhost:8000 | — |
| Prometheus | http://localhost:9090 | — |
| Grafana | http://localhost:3000 | admin / admin |
Once you successfully run the API, you can run the following test, in another terminal (while the API is still running):
INTEGRATION_API_URL=http://localhost:8000 \
uv run pytest unit_tests/test_integration.py -v| File | Tests | What is covered |
|---|---|---|
test_preprocessing.py |
6 | DataPreprocessor: clean, feature engineering, split, encoding |
test_api.py |
14 | /predict, /predict/batch, /explain — mocked model |
test_integration.py |
8 | Real HTTP calls — health, predict, batch, explain, metrics, no traceback leak |
├── src/
│ ├── api/
│ │ ├── app.py # FastAPI app (predict, batch, explain, health, metrics)
│ │ ├── schemas.py # Pydantic input/output schemas
│ │ ├── metrics.py # Prometheus metrics definitions
│ │ └── logger.py # Structured prediction logger → S3 sync
│ ├── model/
│ │ ├── train.py # Model training wrapper
│ │ ├── tune.py # Hyperopt objective + model builder (with early stopping)
│ │ ├── evaluate.py # Metrics + confusion matrix → MLflow
│ │ ├── registry.py # MLflow registry: register, promote, load champion
│ │ └── search_space.py # Hyperopt search space (XGBoost, CatBoost, RF)
│ ├── data_processing/
│ │ ├── preprocessing.py # DataPreprocessor (clean → engineer → scale → encode)
│ │ └── data_load.py # S3 data loading via s3fs
│ ├── main.py # Full training entrypoint
│ └── drift_analysis.py # KS test + PSI drift detection
├── .github/workflows/
│ ├── ci.yml # Ruff + unit tests + integration tests + build and push api image
│ ├── retrain.yml # Manual/scheduled retraining (every Monday 2am UTC) --> triggers ci
│ └── drift_check.yml # Daily drift check → triggers retrain if drift detected
│
│
├── monitoring/
│ └── grafana/
│ ├── dashboards/ # Dashboard JSON (auto-provisioned)
│ └── provisioning/ # Datasources, dashboards, alerting rules
├── unit_tests/
├── Dockerfile
├── docker-compose.yml
├── pyproject.toml # Python project + ruff config
├── uv.lock # Pinned dependency lockfile
└── config.py # All training constants (CV_FOLDS, MAX_EVALS, thresholds…)
| Method | Path | Description |
|---|---|---|
GET |
/ or /ui/ |
Web UI — loan assessment form |
GET |
/health |
Returns 200 if model loaded, 503 otherwise |
POST |
/predict |
Single loan prediction |
POST |
/predict/batch |
Batch prediction (max 500 per request) |
POST |
/explain |
SHAP feature contributions for one application |
GET |
/metrics |
Prometheus metrics endpoint |
GET |
/docs |
Swagger UI |
SHAP values are computed without the external shap library (incompatible with Python 3.13):
- XGBoost:
get_booster().predict(dmat, pred_contribs=True) - CatBoost:
get_feature_importance(type="ShapValues", data=pool) - RandomForest: global feature importances weighted by prediction deviation
Positive SHAP values push toward approval, negative toward rejection.
| Workflow | Trigger | What it does |
|---|---|---|
ci.yml |
Push touching src/, Dockerfile, pyproject.toml, uv.lock on main branch |
Run unit test + Build Docker image + push to Docker Hub |
retrain.yml |
Manual or every Monday 2am UTC | Full re-training of the model(run src/main.py) + MLflow registry update |
drift_check.yml |
Daily 8am UTC | Download predictions.jsonl from S3 → KS + PSI analysis → trigger retrain.yml if drift detected |
NB: Note that the CD workflows (cd, retrained and drift_check) are performed by a GitHub Actions bot using the automatically generated GITHUB_TOKEN. We chose this approach to ensure durable deployment: even if a user account is removed from GitHub, deployments will still be handled by the bot.
Go to Settings → Secrets and variables → Actions and add:
| Name | Type | Value |
|---|---|---|
DOCKERHUB_TOKEN |
Secret | Docker Hub access token |
AWS_ACCESS_KEY_ID |
Secret | S3 credentials (for retrain + drift check) |
AWS_SECRET_ACCESS_KEY |
Secret | — |
AWS_SESSION_TOKEN |
Secret | — |
AWS_S3_ENDPOINT |
Secret | e.g. minio.lab.sspcloud.fr |
AWS_BUCKET_NAME |
Secret | — |
DOCKERHUB_USERNAME |
Variable | Docker Hub username |
API_URL |
Variable | Deployed API base URL — enables integration tests and post-deploy healthcheck |
--
Make sure to have created a DOCKERHUB_TOKEN with "Read" scope (and that your docker image is public!!)
The deployment of the application is handled by a distinct GitOps repertory: https://github.com/kellybourbon2/Loan-prediction-approval-deployment
If you want to recreate the cluster kubernetes from scratch, you can download the folder deployment/ of this repertory and follow the next steps.
Warning: You can't orchestrate the kubernetes cluster if you have not chosen the role "Admin" during the creation of your SSPCloud VSCode service.
The goal here is to create three pods kubernetes to be able to run our api from any machine:
- one pod building an environement from the official prometheus image pulled
- one pod building an environement from the official grafana image pulled
- one pod building an environement from our loan-api image pulled, that we've build and push earlier to the dockerhub.
- Create a secret yaml manifest at the root of the project:
cp secret.example.yaml secret.yamlEdit secret.yaml with your credentials. These are the same credentials than you enter to your .env file earlier. The secret will be named "loan-api-secret".
- Give this secret to your cluster kubernetes
kubectl apply -f ./secret.yaml
3. Adapt the different manifests kubernetes in the folder "deployment" by changing all the occurence of user-kbourbon with your own kubernetes username. In the deployment.yaml, file, also change "kellybrbn/loan-api" with your own docker image path.
> Note that you can find your kubernetes username in your environnement variables by running:
```bash
env | grep ^KUBERNETES_NAMESPACE4. Give the yaml manifests to the cluster kubernetes
```bash
kubectl apply -f deployment/
You can monitor the pods by running:
kubectl get pods -w If everything goes well, you should see three pods: one for loan-api, one for prometheus and one for grafana. When the three pods are the status: "Running 1/1", they're ready and the application should be exposed on : https://loan-api-kubernetes-username.user.lab.sspcloud.fr
flowchart LR
A[src/ change] -->|push to main| B[CI workflow]
B --> C[lint + tests]
C --> D[build Docker image]
D -->|push sha tag| E[Docker Hub]
D -->|PAT push| F[GitOps repo\ndeployment.yaml]
F -->|detects change| G[ArgoCD]
G -->|sync| H[Kubernetes cluster\nSSP Cloud]
H -->|prediction logs| I[(S3 + MLflow)]
I -->|daily| J[Drift check]
J -->|drift detected| K[Retrain]
K -->|triggers| B
| Metric | Type | Description |
|---|---|---|
loan_predictions_total{result} |
Counter | Approved / rejected counts |
loan_prediction_probability |
Histogram | Distribution of approval probabilities |
loan_prediction_errors_total |
Counter | Prediction errors |
loan_approval_rate |
Gauge | Rolling approval rate (last 100 predictions) |
loan_request_income |
Histogram | Applicant income (drift monitoring) |
loan_request_amount |
Histogram | Loan amount (drift monitoring) |
loan_request_lti_ratio |
Histogram | Loan-to-income ratio (drift monitoring) |
loan_batch_size |
Histogram | Batch request sizes |
| Alert | Condition | Severity |
|---|---|---|
| High Prediction Error Rate | > 5 errors in 5 min | Critical |
| Abnormally Low Approval Rate | < 10% for 5 min | Warning |
| High API Latency | p95 > 2s for 3 min | Warning |
All constants are in config.py:
| Variable | Default | Description |
|---|---|---|
CV_FOLDS |
5 | Stratified K-Fold folds during hyperparameter search |
MAX_EVALS |
10 | Hyperopt iterations (increase for better results, slower training) |
RANDOM_STATE |
42 | Seed for all random operations — guarantees reproducibility |
TEST_SIZE |
0.2 | Holdout fraction (split into calibration + eval) |
F1_PROMOTION_THRESHOLD |
0.5 | Minimum F1 required to promote a challenger to @champion |
MLFLOW_MODEL_NAME |
loan-approval-model |
Model name in the MLflow Registry |
MLFLOW_MODEL_NAME |
Loan Prediction Approval Experiments |
Name of Mlflow experiment |