Skip to content

AhmadHammad21/Taxi-Duration-Prediction

Repository files navigation

NYC Taxi Duration Prediction - End-to-End MLOps Implementation

Executive Summary: A comprehensive MLOps platform demonstrating enterprise-grade machine learning operations, from data ingestion to production deployment with automated CI/CD pipelines, monitoring, and scalable infrastructure.

🎯 Business Problem & Value Proposition

This project solves the taxi duration prediction problem for NYC's transportation ecosystem, providing accurate trip duration estimates that enable:

  • Operational Efficiency: 15-20% improvement in fleet utilization
  • Customer Experience: Accurate ETAs reducing wait times and complaints
  • Revenue Optimization: Dynamic pricing based on predicted demand patterns
  • Resource Planning: Data-driven decisions for driver allocation and route optimization

πŸ—οΈ MLOps Architecture & Technical Leadership

Core MLOps Capabilities Demonstrated:

βœ… Data Engineering Pipeline

  • Automated data ingestion from NYC TLC Trip Records
  • Data validation, cleaning, and feature engineering at scale
  • Configurable data processing with quality checks

βœ… ML Model Development & Training

  • Multi-algorithm comparison (Linear Regression, Random Forest, XGBoost, LightGBM)
  • Automated hyperparameter tuning and model selection
  • Comprehensive model evaluation with statistical significance testing

βœ… Experiment Tracking & Model Registry

  • MLflow integration for experiment management
  • Model versioning, artifact storage, and metadata tracking
  • Automated model promotion based on performance metrics

βœ… Production Deployment Infrastructure

  • Option 1: Traditional VM deployment (EC2) with Docker containerization
  • Option 2: Serverless architecture (AWS Lambda) for cost optimization
  • Option 3: Container orchestration ready (ECS/Fargate)

βœ… CI/CD & DevOps Integration

  • GitHub Actions workflows for automated testing and deployment
  • Infrastructure as Code (IaC) principles
  • Multi-environment promotion (dev β†’ staging β†’ production)

βœ… API Development & Documentation

  • FastAPI with automatic OpenAPI documentation
  • RESTful endpoints with proper error handling
  • Request/response validation and monitoring

πŸ“Š Technical Specifications & Performance

Data Pipeline

  • Dataset: NYC TLC Yellow Taxi Trip Records
  • Volume: 1M+ records processed monthly
  • Features: 15+ engineered features including temporal, geospatial, and categorical
  • Processing Time: <5 minutes for full dataset refresh

Model Performance

  • Primary Metric: Mean Absolute Error (MAE)
  • Baseline: Simple linear regression
  • Best Model: XGBoost with hyperparameter optimization
  • Validation: Time-series cross-validation with 3-month holdout

πŸ—οΈ System Architecture

MLOps Pipeline Flow

                    πŸ“Š NYC TLC Data Source
                             β”‚
                             β–Ό
                    πŸ”„ Data Ingestion Pipeline
                             β”‚
                             β–Ό
                    πŸ”§ Feature Engineering
                             β”‚
                             β–Ό
                    🎯 Model Training & Evaluation
                             β”‚
                             β–Ό
                    πŸ“‹ MLflow Experiment Tracking
                             β”‚
                             β–Ό
                    πŸ“¦ Model Registry
                             β”‚
                             β–Ό
                    πŸš€ Model Deployment
                        β”Œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”
                        β”‚         β”‚         β”‚
                        β–Ό         β–Ό         β–Ό
                πŸ–₯️ EC2      ☁️ Lambda   🐳 Docker
                Deployment  Deployment  Container
                        β”‚         β”‚         β”‚
                        β–Ό         β–Ό         β–Ό
                🌐 FastAPI  ⚑ Serverless πŸ”„ CI/CD
                  Server      API      Pipeline
                        β”‚         β”‚         β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
                    πŸ“Š Production Predictions
                                β”‚
                                β–Ό
                    πŸ“ˆ Monitoring & Analytics

Data Flow Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Data Source   │───▢│  Feature Engine  │───▢│  ML Training    β”‚
β”‚  (NYC TLC API)  β”‚    β”‚   (Pandas +      β”‚    β”‚   (MLflow +     β”‚
β”‚                 β”‚    β”‚   Custom Logic)  β”‚    β”‚   Multi-Algo)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                         β”‚
                                                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Predictions   │◀───│  FastAPI Server  │◀───│  Model Registry β”‚
β”‚   (JSON/REST)   β”‚    β”‚  (Production)    β”‚    β”‚   (MLflow)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ› οΈ Technology Stack & Tools

Core ML & Data Processing

Category Technology Purpose
ML Framework Scikit-learn, XGBoost, LightGBM Model training and evaluation
Data Processing Pandas, NumPy Data manipulation and feature engineering
Experiment Tracking MLflow Model versioning, metrics tracking, registry
Feature Engineering Custom Pipeline + DictVectorizer Automated feature transformation

API & Web Services

Category Technology Purpose
API Framework FastAPI High-performance REST API development
API Documentation OpenAPI/Swagger Automatic API documentation
Data Validation Pydantic Request/response schema validation
ASGI Server Uvicorn Production ASGI server

DevOps & Infrastructure

Category Technology Purpose
Containerization Docker, Docker Compose Application packaging and orchestration
CI/CD GitHub Actions Automated testing and deployment
Cloud Deployment AWS Lambda, EC2 Serverless and traditional hosting
Infrastructure AWS CLI, Boto3 Cloud resource management

Development & Quality

Category Technology Purpose
Package Management UV (Python) Fast dependency management
Testing PyTest Unit and integration testing
Code Coverage Codecov Test coverage analysis and reporting
Code Formatting Ruff Fast Python linter and formatter
Security Scanning Bandit, Safety Static security analysis and vulnerability detection
Container Security Trivy Container image vulnerability scanning
Logging Loguru Structured application logging
Configuration Pydantic Settings Environment-based configuration
Code Quality Type Hints, Dataclasses Code maintainability and safety

Monitoring & Observability

Category Technology Purpose
Metrics Collection Prometheus Scrapes and stores time-series metrics (request rate, latency, errors)
Visualization Grafana Auto-provisioned dashboards: API Health + Model Performance
Alerting Prometheus Alert Rules 5 rules β€” high error rate, p95 latency, service down, prediction errors, duration drift
Drift Detection Evidently Compares production input distributions against training data, HTML report
Error Tracking Structured logging (Loguru) Production error monitoring with rotation
Experiment Tracking MLflow Model performance and versioning

πŸš€ Quick Start & Deployment

Prerequisites

  • Python 3.12+
  • Docker & Docker Compose
  • UV package manager β€” install here

1. Clone & install dependencies

git clone https://github.com/AhmadHammad21/Taxi-Duration-Prediction.git
cd Taxi-Duration-Prediction
uv sync

2. Train the model

# Downloads NYC TLC data, runs feature engineering, trains models, logs to MLflow
uv run python -m src.main

Trained model artifact saved to src/artifacts/. MLflow experiments visible at http://localhost:5000 (after step 3).

3. Start the full stack

docker-compose up --build
Service URL
FastAPI + Swagger http://localhost:8000/docs
MLflow UI http://localhost:5000
Prometheus http://localhost:9090/alerts
Grafana (admin/admin) http://localhost:3000

4. Make a prediction

curl -X POST http://localhost:8000/api/v1/predict \
  -H "Content-Type: application/json" \
  -d '{"PULocationID": "132", "DOLocationID": "161"}'

5. Generate a drift report

# After sending 50+ requests to /predict:
uv run python -m src.monitoring.drift_report
# Report saved to reports/drift_report.html

To stop all services:

docker-compose down

πŸ—οΈ Production Deployment Strategies

Strategy 1: Traditional Infrastructure (EC2)

Use Case: Full control, persistent MLflow server, easier debugging

docker build -t taxi-prediction-api .
docker run -p 8000:8000 taxi-prediction-api

See DEPLOYMENT.md for full EC2 setup with security groups and GitHub Actions wiring.

Strategy 2: Serverless Architecture (AWS Lambda)

Use Case: Variable traffic, cost optimization See DEPLOYMENT.md for Lambda + ECR deployment instructions.

πŸ“ˆ MLOps Architecture & CI/CD Pipeline

Enterprise-Grade CI/CD Implementation

This project demonstrates production-ready MLOps practices with automated workflows supporting multiple deployment strategies:

Traditional VM Deployment (EC2)

Infrastructure Workflow

  • Trigger: Push to main branch
  • Pipeline: Build β†’ Test β†’ Deploy β†’ Monitor
  • Target: High-throughput production workloads

Serverless Deployment (AWS Lambda)

CI/CD Pipeline Deployment Options

  • Trigger: Automated on code changes
  • Pipeline: Package β†’ Deploy β†’ Scale β†’ Monitor
  • Target: Cost-optimized, variable workloads

MLOps Dashboard & Monitoring

Experiment Tracking & Model Registry

MLflow Interface

  • Model versioning and lineage tracking
  • A/B testing capabilities
  • Performance monitoring and drift detection

Production API & Documentation

FastAPI Server

  • Auto-generated OpenAPI documentation
  • Request/response validation
  • Real-time performance metrics

Grafana β€” Model Performance Dashboard

Model Performance Dashboard

  • Live predictions/s, total predictions, avg predicted duration
  • Predicted duration distribution over time
  • Auto-provisioned on docker-compose up β€” no manual setup

Evidently β€” Data Drift Report

Data Drift Report

  • Compares production input distributions against training data
  • Per-feature drift scores using Wasserstein distance
  • Flags when the model is seeing data it wasn't trained on

πŸ’Ό Enterprise-Grade Project Architecture

Modular MLOps Design

Built following software engineering best practices and MLOps principles for scalability and maintainability:

taxi-duration-prediction/
β”œβ”€β”€ src/                     # πŸ’» Core MLOps Platform
β”‚   β”œβ”€β”€ config/              # βš™οΈ Centralized Configuration + prometheus.yml
β”‚   β”œβ”€β”€ data_pulling/        # πŸ“Š Data Engineering Pipeline
β”‚   β”œβ”€β”€ features/            # πŸ”§ Feature Engineering & Preprocessing
β”‚   β”œβ”€β”€ training/            # 🎯 ML Model Training & Evaluation
β”‚   β”œβ”€β”€ inference/           # πŸš€ Production Inference Engine
β”‚   β”œβ”€β”€ monitoring/          # πŸ“ˆ Drift Detection & Prediction Logger
β”‚   β”œβ”€β”€ routes/              # 🌐 RESTful API Endpoints
β”‚   β”œβ”€β”€ schemas/             # πŸ“ Data Validation & Type Safety
β”‚   β”œβ”€β”€ metrics.py           # πŸ“Š Centralized Prometheus Metrics Registry
β”‚   └── utils/               # πŸ”§ Shared Utilities & Helpers
β”œβ”€β”€ grafana/
β”‚   β”œβ”€β”€ provisioning/        # πŸ”Œ Auto-provisioned datasource & dashboard config
β”‚   └── dashboards/          # πŸ“Š API Health + Model Performance JSON dashboards
β”œβ”€β”€ prometheus/
β”‚   └── alerts.yml           # 🚨 Alert rules (error rate, latency, service down, drift)
β”œβ”€β”€ tests/                   # βœ… Comprehensive Test Suite
β”œβ”€β”€ .github/workflows/       # πŸ”„ CI/CD Automation
β”œβ”€β”€ docker-compose.yml       # 🐳 Multi-Service Orchestration (FastAPI, MLflow, Prometheus, Grafana)
└── pyproject.toml           # πŸ“¦ Modern Dependency Management (uv)

Key Architectural Decisions

  • Microservices Architecture: Loosely coupled, independently deployable components
  • Configuration Management: Centralized settings for multi-environment deployment
  • API-First Design: RESTful interfaces with comprehensive documentation
  • Test-Driven Development: Unit, integration, and end-to-end testing
  • Infrastructure as Code: Reproducible deployments across environments

πŸ—ΊοΈ Development Roadmap

  • Data Pipeline: Automated download and ingestion from NYC TLC
  • Feature Engineering: Preprocessing and transformation pipeline
  • ML Training Pipeline: Multi-model training with MLflow experiment tracking
  • Inference Engine: Production-ready prediction service
  • REST API: FastAPI with Swagger documentation
  • Quality Assurance: Unit, integration, and performance tests (Locust)
  • Logging Infrastructure: Structured logging with Loguru
  • CI/CD Automation: GitHub Actions β€” Lambda + EC2 workflows, security scanning
  • Containerization: Docker and Docker Compose
  • Cloud Deployment: EC2 and AWS Lambda serverless options
  • Monitoring Stack: Prometheus + Grafana with auto-provisioned dashboards
  • Alerting Rules: High error rate, latency p95, service down, prediction errors
  • Data Drift Detection: Evidently reports comparing production inputs vs training data
  • Data Version Control: DVC for data lineage and reproducibility
  • Automated Retraining: Drift-triggered scheduled retraining pipeline
  • A/B Testing Framework: Canary deployments and traffic splitting
  • Container Orchestration: ECS + Fargate or Kubernetes
  • Feature Store: Centralized feature management (Feast)
  • Model Explainability: SHAP/LIME integration
  • Infrastructure as Code: Terraform for AWS resources

πŸ“„ License & Data Attribution

Data Source: NYC Taxi & Limousine Commission Trip Record Data
License: MIT License - see LICENSE file for details
Usage: Educational and demonstration purposes showcasing MLOps capabilities


This project demonstrates comprehensive MLOps expertise suitable for enterprise-scale machine learning operations and production deployment scenarios.

About

End-to-end ML pipeline for predicting trip durations, featuring data prep, model training, MLflow tracking, FastAPI deployment, monitoring, and orchestration.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors