ML Automation Pipeline

A comprehensive machine learning automation pipeline for homeowner loss history prediction, featuring real-time monitoring, data quality checks, model explainability, and A/B testing.

Overview

This project implements a production-ready ML pipeline with the following key features:

Data Ingestion & Preprocessing: Automated data ingestion from S3, preprocessing, and schema validation
Data Quality Monitoring: Continuous monitoring of data quality with anomaly detection
Model Explainability: SHAP values and feature importance tracking
A/B Testing: Robust A/B testing framework for model promotion
Real-time Monitoring: WebSocket-based real-time dashboard updates
Automated Retraining: Drift detection and automated model retraining
Human-in-the-Loop: Manual override capabilities for critical decisions

Architecture

Components

Airflow DAGs: Orchestrate the entire ML pipeline
Data Quality Monitor: Tracks data quality metrics and detects anomalies
Model Explainability Tracker: Monitors model interpretability
A/B Testing Pipeline: Manages model comparison and promotion
WebSocket Server: Provides real-time updates to the dashboard
MLflow Integration: Tracks experiments and model metrics

AWS Infrastructure

Amazon MWAA: Managed Apache Airflow service
- DAGs stored in S3
- Automatic high availability
- IAM integration
SageMaker Model Registry: Model versioning and deployment
- Automated model promotion
- Version control
- Secure deployment hooks
AWS Amplify: Dashboard hosting
- Automatic builds on main branch
- HTTPS and custom domain
- Environment variable management
API Gateway + Lambda: Real-time updates
- WebSocket API
- Serverless architecture
- Auto-scaling
AWS Secrets Manager: Secure configuration
- Centralized secrets storage
- IAM-based access control
- Encrypted at rest
CloudWatch + SNS: Monitoring and alerts
- Centralized logging
- Custom metrics
- Slack integration

Setup

Prerequisites

Python 3.12+
Node.js 18+
AWS CLI configured
GitHub account
AWS account with appropriate permissions

Local Development

Clone the repository:

git clone https://github.com/yourusername/ml_automation.git
cd ml_automation

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Set up environment variables:

cp .env.template .env
# Edit .env with your configuration

AWS Deployment

Infrastructure Setup:

# Install AWS CDK
npm install -g aws-cdk

# Deploy infrastructure
cd infrastructure
cdk deploy

MWAA Environment:
- Create S3 bucket for DAGs
- Configure MWAA environment
- Set up IAM roles
Model Registry:
- Configure SageMaker Model Registry
- Set up model promotion workflow
- Configure deployment hooks
Dashboard Deployment:
- Connect GitHub repository to Amplify
- Configure build settings
- Set up environment variables
WebSocket API:
- Deploy API Gateway WebSocket API
- Configure Lambda functions
- Set up connections

Development

Project Structure

ml_automation/
├── dags/                    # Airflow DAGs
│   ├── tasks/              # Task implementations
│   └── utils/              # Utility functions
├── loss-history-dashboard/ # React dashboard
├── infrastructure/         # AWS CDK code
├── tests/                  # Test files
├── requirements.txt        # Python dependencies
└── .env.template          # Environment template

CI/CD Pipeline

GitHub Actions Workflow:
- Run tests on PR
- Deploy infrastructure
- Update MWAA environment
- Deploy dashboard
Infrastructure as Code:
- AWS CDK for infrastructure
- Automated deployments
- Environment management
Monitoring:
- CloudWatch metrics
- SNS notifications
- Slack integration

Testing

Run the test suite:

# Backend tests
pytest tests/

# Frontend tests
cd loss-history-dashboard
npm test

Contributing

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 191 Commits
.devcontainer		.devcontainer
Jupyter Notebook		Jupyter Notebook
dags		dags
infrastructure		infrastructure
loss-history-dashboard		loss-history-dashboard
mlflow-export-import		mlflow-export-import
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
ADD_SLACK_PERMISSIONS.md		ADD_SLACK_PERMISSIONS.md
CONFIGURATION.md		CONFIGURATION.md
MODEL_COMPARISON.md		MODEL_COMPARISON.md
README.md		README.md
airflow.cfg		airflow.cfg
amplify.yml		amplify.yml
fix_websocket_host.py		fix_websocket_host.py
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt
requirements_airflow.txt		requirements_airflow.txt
requirements_core.txt		requirements_core.txt
requirements_web.txt		requirements_web.txt
setup-integration.md		setup-integration.md
streamlit_app.py		streamlit_app.py
webserver_config.py		webserver_config.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML Automation Pipeline

Overview

Architecture

Components

AWS Infrastructure

Setup

Prerequisites

Local Development

AWS Deployment

Development

Project Structure

CI/CD Pipeline

Testing

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

RayFrightener/ml_automation

Folders and files

Latest commit

History

Repository files navigation

ML Automation Pipeline

Overview

Architecture

Components

AWS Infrastructure

Setup

Prerequisites

Local Development

AWS Deployment

Development

Project Structure

CI/CD Pipeline

Testing

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages