Autonomous Model Drift Detection & MLOps Recovery System
DriftGuard is an enterprise-grade MLOps dashboard designed to monitor, detect, and remediate machine learning model degradation in production using advanced statistical methods like Population Stability Index (PSI), Kolmogorov-Smirnov testing, and Kullback-Leibler Divergence. It replaces opaque model failure with quantifiable health metrics and automated retraining strategies.
In production environments, ML models don't fail with an error stack trace; they fail silently as data distributions shift (Data Drift) or relationships change (Concept Drift).
DriftGuard solves this by providing a continuous monitoring layer that:
- Quantifies Drift: Uses statistical methods like Population Stability Index (PSI) to measure distribution shifts.
- Visualizes Impact: Correlates drift scores with estimated accuracy drops.
- Prescribes Action: Automates the cost-benefit analysis of retraining models versus letting them run.
- Observability First: Dashboard-centric view of model health ($ Health Score).
- Statistical Rigor: Reliance on proven metrics (PSI, Kolmogorov-Smirnov test, KL Divergence) rather than simple distinct counts.
- Actionable Insights: Recommendations are linked to business value (Revenue at Risk vs. Retraining Cost).
- Real-time Health Score: A composite metric (0-100) derived from drift severity across all features.
- Dynamic Metrics: Tracks "Total Predictions", "Average Drift Score", and "Estimated Accuracy" live.
- Feature-Level Diagnostics: Identifies exactly which features (e.g.,
Income,Age,Debt Ratio) are causing the model to degrade. - Deep Dive Analysis: Drill down into specific features (click "Analyze") to view histograms, PSI/KS/KL metrics, and descriptive statistics comparing training vs. production data.
- Authentication: Secure Login and Signup flow with JWT-ready structure (currently demo mode).
- Role-Based Access: Foundations for Admin vs. Viewer roles.
- Cost-Benefit Engine: Automatically calculates whether it is profitable to retrain the model based on current revenue loss vs. compute costs.
- Automated Scheduling: One-click scheduling for retraining jobs when thresholds are breached.
- Historical Analysis: View drift trends over 30/60/90 days to identify slow-burning degradation.
- Interactive Reports: Export comprehensive drift reports (
.csv,.pdf) for compliance and auditing.
- Configurable Rules: Set conditional alerts (e.g., "If Income PSI > 0.2 for 6 hours").
- Multi-Channel Notification: Integration logic for Slack, Email, and PagerDuty (simulated).
DriftGuard uses multiple statistical methods to detect distributional shifts: Population Stability Index (PSI), Kolmogorov-Smirnov (KS) test, and Kullback-Leibler (KL) Divergence. Here is a snippet of the detection logic from the backend:
# backend/drift_detection.py
def calculate_psi(expected_array, actual_array, buckets=10, bucket_type='quantiles'):
# Calculates Population Stability Index (PSI) to measure data drift.
# PSI < 0.1: No significant drift
# PSI < 0.2: Moderate drift
# PSI >= 0.2: Significant drift
def calculate_ks(expected_array, actual_array):
# Calculates Kolmogorov-Smirnov statistic for distribution comparison
def calculate_kl(expected_array, actual_array, buckets=10, bucket_type='quantiles'):
# Calculates Kullback-Leibler Divergence for distribution shift measurement input += (1e-6) # Avoid division by zero
interp = np.interp(input, (min, max), (0, 1))
return interp
breakpoints = np.arange(0, buckets + 1) / (buckets) * 100
# ... logic to calculate proportions ...
psi_value = np.sum((actual_prop - expected_prop) * np.log(actual_prop / expected_prop))
return psi_value
---
## 🏗️ Demo Scenarios
The project includes a robust simulation engine `demo_scenarios.py` to demonstrate various production states:
| Scenario | Description | Effect |
| :--- | :--- | :--- |
| **1. Baseline (Healthy)** | Normal distribution matching training data. | Health Score: ~98/100 |
| **2. Sudden Drift (Attack)** | Simulates a sudden shift in high-importance features. | Health Score: ~45/100 |
| **3. Gradual Decay** | Slowly introduces noise over time. | Health Score: ~80/100 |
Run a scenario using:
```bash
python demo_scenarios.py --scenario 2
DriftGuard/
├── backend/ # Flask API & ML Logic
│ ├── app.py # Main application entry point
│ ├── drift_detection.py # Core math for PSI/Drift
│ ├── data_generator.py # Synthetic data generation
│ └── service.py # Business logic layer
├── frontend/ # React Dashboard
│ ├── src/
│ │ ├── components/ # Dashboard, Trends, Alerts, etc.
│ │ ├── App.tsx # Main routing & layout
│ │ └── types.ts # TypeScript definitions
│ └── tailwind.config.js
├── data/ # Local CSV storage for demo
└── demo_scenarios.py # CLI tool for drift simulation- Python 3.9+
- Node.js 16+
cd backend
# Create virtual environment (optional)
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
# Install dependencies (ensure pandas, numpy, flask are installed)
pip install flask flask-cors pandas numpy
# Run the API
python app.pyServer runs on http://localhost:5000
cd frontend
# Install dependencies
npm install
# Start Development Server
npm run devDashboard runs on http://localhost:5173
Contributions to improve drift detection algorithms or add new visualization widgets are welcome.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/NewMetric) - Commit your Changes (
git commit -m 'Add KL Divergence metric') - Push to the Branch (
git push origin feature/NewMetric) - Open a Pull Request
Distributed under the MIT License. See LICENSE for more information.
Shafayat Saad - MLOps Engineer