A production-ready MLOps implementation of a credit card default prediction model, demonstrating modern ML engineering practices including automated training, prediction pipelines, model versioning, and business-focused metrics.
This project implements an end-to-end MLOps pipeline for credit card default prediction using the UCI ML Credit Card Default Dataset, featuring:
- Automated model retraining with hyperparameter optimization
- Business-focused metrics and cost-sensitive evaluation
- Model versioning and experiment tracking with MLflow
- Champion/Challenger model deployment strategy
- Comprehensive data validation and preprocessing
- Batch prediction capabilities
The system implements a complete MLOps architecture with automated pipelines for data processing, model training, and predictions. This design ensures reproducibility, scalability, and production-readiness.
┌─────────────────┐ ┌───────────────────┐ ┌──────────────────┐
│ Data Pipeline │ │ Training Pipeline │ │ Prediction │
│ ──────────── │ │ ──────────────── │ │ Pipeline │
│ │ │ │ │ ──────────────── │
│ ┌─────────┐ │ │ ┌─────────────┐ │ │ ┌────────────┐ │
│ │ Kaggle │ │ │ │ Parameter │ │ │ │ Model │ │
│ │ Dataset │───┐│ │ │ Optimization│ │ │ │ Resolution │ │
│ └─────────┘ ││ │ └──────┬──────┘ │ │ └─────┬──────┘ │
│ ││ │ │ │ │ │ │
│ ┌─────────┐ ││ │ ┌──────▼──────┐ │ │ ┌─────▼──────┐ │
│ │ Validate │◄──┘│ │ │ Cross- │ │ │ │ Prediction │ │
│ │ Data │ │ │ │ Validation │ │ │ │ Generation │ │
│ └─────┬───┘ │ │ └──────┬──────┘ │ │ └─────┬──────┘ │
│ │ │ │ │ │ │ │ │
│ ┌─────▼─────┐ │ │ ┌──────▼──────┐ │ │ ┌─────▼──────┐ │
│ │ Preprocess│ │ │ │ Model │ │ │ │ Feature │ │
│ │ Features │ │ │ │ Training │ │ │ │ Importance │ │
│ └─────┬─────┘ │ │ └──────┬──────┘ │ │ └─────┬──────┘ │
│ │ │ │ │ │ │ │ │
│ ┌─────▼─────┐ │ │ ┌──────▼──────┐ │ │ ┌─────▼──────┐ │
│ │ Processed │ │ │ │ SHAP │ │ │ │ Default │ │
│ │ Dataset │──┼─────┼─►│ Explanations│ │ │ │ Probability│ │
│ └───────────┘ │ │ └──────┬──────┘ │ │ └─────┬──────┘ │
└─────────────────┘ │ │ │ │ │ │
│ ┌──────▼──────┐ │ │ ┌─────▼──────┐ │
┌──────────────────┐ │ │ MLflow │ │ │ │ Prediction │ │
│ Visualization │ │ │ Logging │◄─┼─────┼──┤ Explanations │
│ ─────────────── │ │ └─────────────┘ │ │ └────────────┘ │
│ ┌────────────┐ │ └───────────────────┘ └──────────────────┘
│ │ Performance│◄─┼──────────────────────────┐
│ │ Charts │ │ │
│ └────────────┘ │ ┌───────────────────┐ │
│ │ │ CI/CD Automation │ │
│ ┌────────────┐ │ │ ──────────────── │ │
│ │ Feature │◄─┼────┤ │ │
│ │ Importance │ │ │ ┌─────────────┐ │ │
│ └────────────┘ │ │ │ Github │ │ │
│ │ │ │ Actions │──┼─┘
│ ┌────────────┐ │ │ └─────────────┘ │
│ │ SHAP │◄─┼────┤ │
│ │ Plots │ │ │ ┌─────────────┐ │
│ └────────────┘ │ │ │ Automatic │ │
└──────────────────┘ │ │ Retraining │ │
│ └─────────────┘ │
└───────────────────┘
The system consists of three main pipelines:
-
Data Pipeline:
- Validates and preprocesses credit card data with domain-specific checks
- Handles missing values and outliers with appropriate business rules
- Performs feature engineering tailored to financial default prediction
- Integrates with Kaggle API for automated dataset retrieval
-
Training Pipeline:
- Automatically retrains the model with cost-sensitive optimization
- Implements hyperparameter optimization using Optuna
- Performs rigorous cross-validation with stratification
- Generates SHAP explanations for model interpretability
- Logs all training metrics and artifacts to MLflow
-
Prediction Pipeline:
- Generates probability-based default predictions with explanations
- Resolves the champion model from MLflow registry
- Provides detailed SHAP-based explanations for each prediction
- Calculates business-specific metrics (approval rate, cost per decision)
- Optimizes probability threshold based on custom cost matrix
-
MLOps Infrastructure:
- Automated CI/CD with GitHub Actions
- Integrated experiment tracking with MLflow
- Automated visualization generation
- Model versioning and champion/challenger approach
- Comprehensive documentation and model cards
├── creditrisk/ # Main package directory
│ ├── config.py # Configuration and constants
│ ├── validation.py # Data validation
│ ├── preproc.py # Data preprocessing
│ ├── metrics.py # Business metrics
│ ├── train.py # Training pipeline
│ ├── predict.py # Prediction pipeline
│ └── resolve.py # Model resolution logic
├── data/ # Data directory
├── docs/ # Documentation and model cards
├── .mlflow/ # MLflow tracking
├── models/ # Model artifacts
├── notebooks/ # Development notebooks
└── reports/ # Generated analysis
- Python 3.11.9 or higher
- Kaggle account and API key
- UV package manager
- MLflow (for experiment tracking and model management)
- Clone the repository:
git clone <your-repo-url>
cd ARISA-MLOps- Create and activate virtual environment:
python -m venv .venv
# Windows
.\.venv\Scripts\activate
# Mac/Linux
source .venv/bin/activate- Install UV and dependencies:
pip install uv
uv pip install -e .- Initialize MLflow:
# Create MLflow directories with correct permissions
mkdir -p .mlflow/db
mkdir -p .mlflow/artifacts
chmod -R 775 .mlflow
# Set MLflow tracking URI to use local directory
export MLFLOW_TRACKING_URI="file://${PWD}/.mlflow"
# Create a new MLflow experiment
mlflow experiments create -n "credit-default-prediction"
# Set it as the active experiment
export MLFLOW_EXPERIMENT_NAME="credit-default-prediction"
# Start the MLflow UI server (run in background)
mlflow ui --backend-store-uri sqlite:///.mlflow/db/mlflow.db --default-artifact-root .mlflow/artifacts
# Access the MLflow UI at http://localhost:5000-
Set up Kaggle authentication:
To create and set up your Kaggle API key:
- Log in to your Kaggle account at kaggle.com
- Go to "Account" by clicking on your profile picture in the top-right corner
- Scroll down to the "API" section
- Click "Create New API Token" - this will download a
kaggle.jsonfile - Place your
kaggle.jsonin:- Windows:
C:\Users\USERNAME\.kaggle - Mac/Linux:
/home/username/.config/kaggle
- Windows:
- Ensure the permissions are secure:
chmod 600 ~/.kaggle/kaggle.json(Linux/Mac)
-
Set up GitHub Personal Access Token (for workflows):
To create a GitHub PAT for workflow automation:
- Log in to your GitHub account
- Go to "Settings" → "Developer settings" → "Personal access tokens"
- Click "Generate new token" (choose "Fine-grained tokens" for better security)
- Give your token a descriptive name
- Set an appropriate expiration date
- Select required permissions (typically "repo" for full repository access)
- Click "Generate token"
- IMPORTANT: Copy and save your token securely - GitHub will only show it once!
- Store it as a repository secret:
- Go to your repository → "Settings" → "Secrets and variables" → "Actions"
- Click "New repository secret"
- Name it
WORKFLOW_PAT(or your preferred name) - Paste your token and click "Add secret"
Run the entire pipeline (preprocessing, training, and prediction) with a single command:
make preprocess && make train && make predictYou can also run each component separately if needed:
Process and validate new credit card data:
# Download and preprocess data
make preprocess
# Validate specific dataset
python -m creditrisk.models.validation --input path/to/data.csvTrain a new model with optimized hyperparameters:
# Full training pipeline with hyperparameter optimization
make train
# Cross-validation only
python -m creditrisk.models.train --cv-only
# Quick training with default parameters
python -m creditrisk.models.train --quickGenerate default predictions for new customers:
# Run predictions using the latest model
make predict
# Single prediction with explanation
python -m creditrisk.models.predict --explain customer_data.jsonThis project implements comprehensive continuous integration and continuous delivery through GitHub Actions workflows that automate the entire MLOps pipeline.
The system automatically retrains the model whenever relevant changes are detected, using the retrain_on_change.yml workflow:
- Push to main branch affecting:
- Data files (
data/raw/UCI_Credit_Card.csv,data/processed/train.csv) - Python code in the
creditriskpackage - Model hyperparameter files (
models/best_params.pkl) - Workflow file itself
- Data files (
- Manual trigger via GitHub Actions UI
KAGGLE_USERNAME: Your Kaggle usernameKAGGLE_KEY: Your Kaggle API keyWORKFLOW_PAT: GitHub Personal Access Token with repo permissions (see Setup section)
-
Preprocessing Job:
- Sets up Python environment with UV package manager
- Configures Kaggle authentication for dataset access
- Creates necessary directory structure for artifacts
- Runs linting on key files with pylint
- Runs data preprocessing with automatic retries (3 attempts)
- Implements comprehensive error handling and debugging
- Verifies processed data exists and uploads as artifact
- Provides detailed logging for troubleshooting
-
Training Job:
- Downloads processed data artifact from previous job
- Sets up Python and MLflow environment
- Creates and configures MLflow directories with correct permissions
- Runs full training pipeline with automatic retries:
- Hyperparameter optimization with Optuna
- Cross-validation training
- Final model training with SHAP explanations
- Temporarily disables branch protection using the PAT
- Commits and pushes model artifacts, MLflow data, and reports
- Restores branch protection rules
- Triggers downstream prediction workflow via repository dispatch
- Implements detailed logging and error handling
The predict_on_model_change.yml workflow automatically runs predictions when a new model is registered:
- Repository dispatch from training workflow
- Push to main branch affecting:
- MLflow artifacts and database
- Prediction or model resolution code
- Workflow file itself
- Manual trigger via GitHub Actions UI
-
Environment Setup:
- Configures Python environment with dependencies
- Sets up Kaggle authentication
- Prepares test dataset for prediction
-
MLflow Configuration:
- Creates and configures MLflow directories
- Sets tracking URI and artifact root
- Verifies connection to MLflow registry
-
Model Resolution and Prediction:
- Resolves the "champion" model from MLflow registry
- Runs predictions on test dataset
- Generates SHAP explanations for predictions
- Creates visualization artifacts
- Uploads prediction results as artifacts
The fix_mlflow_artifacts.yml workflow provides a maintenance utility for MLflow artifact handling:
- Manual workflow to fix or update MLflow artifacts
- Ensures visualization artifacts are correctly linked to MLflow runs
- Creates or updates metrics.json files for MLflow tracking
- Artifact Management:
- Copies visualizations from reports directory to MLflow artifacts
- Creates or updates metrics.json with key model statistics
- Commits changes to maintain version control of artifacts
For the workflows to function properly, you must add the following secrets to your GitHub repository:
- KAGGLE_USERNAME: Your Kaggle username
- KAGGLE_KEY: Your Kaggle API key
- WORKFLOW_PAT: A GitHub Personal Access Token with
reposcope
-
Add secrets in GitHub:
- Go to your repository → "Settings" → "Secrets and variables" → "Actions"
- Click "New repository secret"
- Add each secret with its name and value
- Click "Add secret"
-
Kaggle credentials format:
- The KAGGLE_KEY should be your API key from your Kaggle account
-
GitHub PAT permissions:
- For WORKFLOW_PAT, ensure it has the following permissions:
repo(Full control of private repositories)workflow(Update GitHub Action workflows)
- For WORKFLOW_PAT, ensure it has the following permissions:
See the Setup section above for detailed instructions on creating these secrets.
The model is optimized for business impact using:
- Precision-Recall AUC for imbalanced classification
- Custom cost matrix (FP: wrongly denied credit, FN: default loss)
- Business metrics (approval rate, default rate, avg cost per decision)
Current performance metrics:
- F1 Score (CV Mean): 0.477
- Feature Group Importance (SHAP values):
- Bill Amounts: 0.031
- Payment History: 0.030
- Payment Amounts: 0.028
- Demographics: 0.019
These SHAP values indicate that financial behavior (bill and payment patterns) has the strongest influence on default prediction, while demographic factors have relatively less impact.
The project uses MLflow for experiment tracking and model versioning. To view the MLflow UI:
- Ensure MLflow environment is properly configured:
# Set MLflow tracking URI if not already set
export MLFLOW_TRACKING_URI="file://${PWD}/.mlflow"
# Verify MLflow directory structure
ls -la .mlflow/db
ls -la .mlflow/artifacts- Start the MLflow UI server:
# Start MLflow UI on port 5000
mlflow ui --backend-store-uri sqlite:///.mlflow/db/mlflow.db --default-artifact-root .mlflow/artifacts --port 5000- Access the UI in your browser at: http://127.0.0.1:5000
Note: If you encounter permission issues or missing meta.yaml files:
- Check directory permissions:
# Fix MLflow directory permissions
chmod -R 775 .mlflow- Reinitialize the experiment if needed:
mlflow experiments create -n "credit-default-prediction"
export MLFLOW_EXPERIMENT_NAME="credit-default-prediction"Key MLflow features in this project:
- Experiment tracking with metrics, parameters, and artifacts
- Model versioning and deployment management
- Performance visualization and comparison
- Model registry for production deployment
If you encounter permission errors when logging artifacts:
- Ensure your user has write permissions to the MLflow directory:
sudo chown -R $USER:$USER .mlflow
chmod -R u+w .mlflow- For CI/CD environments, verify the MLflow directory structure:
# Verify MLflow directories exist
ls -la .mlflow/db
ls -la .mlflow/artifacts
# Recreate if needed
mkdir -p .mlflow/db
mkdir -p .mlflow/artifacts
chmod -R 775 .mlflowSee docs/model_card.md for detailed information about:
- Model characteristics and architecture
- Training data and preprocessing
- Performance benchmarks and metrics
- Intended use cases and limitations
- Fairness considerations and bias analysis
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
MIT
Piotr Gryko
- UCI ML Credit Card Default Dataset
- MLOps architecture inspired by ml-ops.org