A comprehensive machine learning automation pipeline for homeowner loss history prediction, featuring real-time monitoring, data quality checks, model explainability, and A/B testing.
This project implements a production-ready ML pipeline with the following key features:
- Data Ingestion & Preprocessing: Automated data ingestion from S3, preprocessing, and schema validation
- Data Quality Monitoring: Continuous monitoring of data quality with anomaly detection
- Model Explainability: SHAP values and feature importance tracking
- A/B Testing: Robust A/B testing framework for model promotion
- Real-time Monitoring: WebSocket-based real-time dashboard updates
- Automated Retraining: Drift detection and automated model retraining
- Human-in-the-Loop: Manual override capabilities for critical decisions
- Airflow DAGs: Orchestrate the entire ML pipeline
- Data Quality Monitor: Tracks data quality metrics and detects anomalies
- Model Explainability Tracker: Monitors model interpretability
- A/B Testing Pipeline: Manages model comparison and promotion
- WebSocket Server: Provides real-time updates to the dashboard
- MLflow Integration: Tracks experiments and model metrics
-
Amazon MWAA: Managed Apache Airflow service
- DAGs stored in S3
- Automatic high availability
- IAM integration
-
SageMaker Model Registry: Model versioning and deployment
- Automated model promotion
- Version control
- Secure deployment hooks
-
AWS Amplify: Dashboard hosting
- Automatic builds on main branch
- HTTPS and custom domain
- Environment variable management
-
API Gateway + Lambda: Real-time updates
- WebSocket API
- Serverless architecture
- Auto-scaling
-
AWS Secrets Manager: Secure configuration
- Centralized secrets storage
- IAM-based access control
- Encrypted at rest
-
CloudWatch + SNS: Monitoring and alerts
- Centralized logging
- Custom metrics
- Slack integration
- Python 3.12+
- Node.js 18+
- AWS CLI configured
- GitHub account
- AWS account with appropriate permissions
-
Clone the repository:
git clone https://github.com/yourusername/ml_automation.git cd ml_automation -
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables:
cp .env.template .env # Edit .env with your configuration
-
Infrastructure Setup:
# Install AWS CDK npm install -g aws-cdk # Deploy infrastructure cd infrastructure cdk deploy
-
MWAA Environment:
- Create S3 bucket for DAGs
- Configure MWAA environment
- Set up IAM roles
-
Model Registry:
- Configure SageMaker Model Registry
- Set up model promotion workflow
- Configure deployment hooks
-
Dashboard Deployment:
- Connect GitHub repository to Amplify
- Configure build settings
- Set up environment variables
-
WebSocket API:
- Deploy API Gateway WebSocket API
- Configure Lambda functions
- Set up connections
ml_automation/
├── dags/ # Airflow DAGs
│ ├── tasks/ # Task implementations
│ └── utils/ # Utility functions
├── loss-history-dashboard/ # React dashboard
├── infrastructure/ # AWS CDK code
├── tests/ # Test files
├── requirements.txt # Python dependencies
└── .env.template # Environment template
-
GitHub Actions Workflow:
- Run tests on PR
- Deploy infrastructure
- Update MWAA environment
- Deploy dashboard
-
Infrastructure as Code:
- AWS CDK for infrastructure
- Automated deployments
- Environment management
-
Monitoring:
- CloudWatch metrics
- SNS notifications
- Slack integration
Run the test suite:
# Backend tests
pytest tests/
# Frontend tests
cd loss-history-dashboard
npm test- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.