Skip to content

End-to-end machine learning pipeline for trip duration prediction with feature engineering, regression models, and automated evaluation.

License

Notifications You must be signed in to change notification settings

Z33xD/GoPredict

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

98 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

GoPredict - Machine Learning Pipeline for Trip Duration Prediction

A comprehensive machine learning pipeline for predicting trip durations using various regression models, feature engineering, and hyperparameter optimization.

Medium post - https://medium.com/@hphadtare02/how-machine-learning-predicts-trip-duration-just-like-uber-zomato-91f7db6e9ce9

๐Ÿ“ Project Structure

GoPredict/
โ”œโ”€โ”€ main.py                          # Main runner script
โ”œโ”€โ”€ config.py                        # Project configuration
โ”œโ”€โ”€ requirements.txt                  # Python dependencies
โ”œโ”€โ”€ README.md                        # This file
โ”‚
โ”œโ”€โ”€ data/                            # Data directory
โ”‚   โ”œโ”€โ”€ raw/                         # Raw data files
โ”‚   โ”‚   โ”œโ”€โ”€ train.csv               # Training data
โ”‚   โ”‚   โ””โ”€โ”€ test.csv                # Test data
โ”‚   โ”œโ”€โ”€ processed/                   # Processed data files
โ”‚   โ”‚   โ”œโ”€โ”€ feature_engineered_train.csv
โ”‚   โ”‚   โ”œโ”€โ”€ feature_engineered_test.csv
โ”‚   โ”‚   โ””โ”€โ”€ gmapsdata/              # Google Maps data
โ”‚   โ””โ”€โ”€ external/                    # External data sources
โ”‚       โ””โ”€โ”€ precipitation.csv       # Weather data
โ”‚
โ”œโ”€โ”€ src/                            # Source code
โ”‚   โ”œโ”€โ”€ model/                      # Model-related modules
โ”‚   โ”‚   โ”œโ”€โ”€ models.py              # All ML models and pipeline
โ”‚   โ”‚   โ”œโ”€โ”€ evaluation.py          # Model evaluation functions
โ”‚   โ”‚   โ””โ”€โ”€ save_models.py         # Model persistence
โ”‚   โ”œโ”€โ”€ features/                   # Feature engineering modules
โ”‚   โ”‚   โ”œโ”€โ”€ distance.py            # Distance calculations
โ”‚   โ”‚   โ”œโ”€โ”€ geolocation.py         # Geographic features
โ”‚   โ”‚   โ”œโ”€โ”€ gmaps.py               # Google Maps integration
โ”‚   โ”‚   โ”œโ”€โ”€ precipitation.py       # Weather features
โ”‚   โ”‚   โ””โ”€โ”€ time.py                # Time-based features
โ”‚   โ”œโ”€โ”€ feature_pipe.py            # Feature engineering pipeline
โ”‚   โ”œโ”€โ”€ data_preprocessing.py      # Data preprocessing
โ”‚   โ””โ”€โ”€ complete_pipeline_example.py # Usage examples
โ”‚
โ”œโ”€โ”€ notebooks/                      # Jupyter notebooks
โ”‚   โ”œโ”€โ”€ 01_EDA.ipynb               # Exploratory Data Analysis
โ”‚   โ”œโ”€โ”€ 02_Feature_Engineering.ipynb # Feature engineering
โ”‚   โ”œโ”€โ”€ 03_Model_Training.ipynb    # Model training
โ”‚   โ”œโ”€โ”€ figures/                   # Generated plots
โ”‚   โ””โ”€โ”€ gmaps/                     # Interactive maps
โ”‚
โ”œโ”€โ”€ saved_models/                   # Trained models (auto-created)
โ”œโ”€โ”€ output/                         # Predictions and submissions (auto-created)
โ””โ”€โ”€ logs/                          # Log files (auto-created)

๐Ÿš€ Quick Start

1. Installation

# Clone the repository
git clone <your-repo-url>
cd GoPredict

# Install dependencies
pip install -r requirements.txt

# Create necessary directories
mkdir -p logs output saved_models

2. Data Preparation

Ensure you have the following data files in place:

  • data/raw/train.csv - Training data
  • data/raw/test.csv - Test data
  • data/external/precipitation.csv - Weather data

3. Run the Pipeline

# Run COMPLETE end-to-end pipeline (RECOMMENDED)
python main.py --mode complete

# Run complete pipeline with all models (assumes feature engineering is done)
python main.py --mode full

# Train specific models only (assumes feature engineering is done)
python main.py --mode train --models LINREG,RIDGE,XGB

# Make predictions only (assumes feature engineering is done)
python main.py --mode predict --models XGB

# Hyperparameter tuning only (assumes feature engineering is done)
python main.py --mode tune

# Enable XGBoost hyperparameter tuning
python main.py --mode complete --tune-xgb

๐Ÿ“Š Available Models

Model Code Description
Linear Regression LINREG Baseline linear model
Ridge Regression RIDGE Linear with L2 regularization
Lasso Regression LASSO Linear with L1 regularization
Support Vector Regression SVR Support vector machine
XGBoost XGB Gradient boosting (best performer)
Random Forest RF Ensemble of decision trees
Neural Network NN Deep learning model

๐ŸŽฏ Usage

Simple Pipeline (Default)

python main.py

Runs the complete end-to-end pipeline:

  • Data preprocessing - Loads and cleans raw data
  • Feature engineering - Adds distance, time, cluster, and weather features
  • Model training - Trains all specified models
  • Model evaluation - Compares model performance
  • Prediction generation - Creates submission files

Custom Models

python main.py --models XGB,RF

Train only specific models.

With Hyperparameter Tuning

python main.py --tune-xgb

Enable XGBoost hyperparameter tuning.

๐Ÿ“ˆ Output Files

Predictions

  • output/[model_name]/test_prediction_YYYYMMDD_HHMMSS.csv
  • Ready-to-submit prediction files with timestamps

Models

  • saved_models/[model_name]_YYYYMMDD_HHMMSS.pkl
  • Trained models with metadata

Logs

  • logs/main.log - Complete pipeline execution log
  • Detailed progress tracking and metrics

Visualizations

  • output/prediction_comparison_YYYYMMDD_HHMMSS.png
  • Model comparison plots
  • Feature importance plots

๐Ÿ”ง Configuration

Edit config.py to customize:

  • Model parameters
  • Data paths
  • Output directories
  • Hyperparameter tuning ranges
  • Logging settings

๐Ÿ“ Usage Examples

Basic Usage

from src.model.models import run_complete_pipeline
import pandas as pd

# Load data
train_df = pd.read_csv('data/processed/feature_engineered_train.csv')
test_df = pd.read_csv('data/processed/feature_engineered_test.csv')

# Run complete pipeline
results = run_complete_pipeline(
    train_df=train_df,
    test_df=test_df,
    models_to_run=['LINREG', 'RIDGE', 'XGB'],
    tune_xgb=True,
    create_submission=True
)

Individual Components

from src.model.models import run_regression_models, predict_duration, to_submission

# Train models
models = run_regression_models(train_df, ['XGB', 'RF'])

# Make predictions
predictions = predict_duration(models['XGBoost'], test_df)

# Create submission
submission_file = to_submission(predictions)

Hyperparameter Tuning

from src.model.models import hyperparameter_tuning_xgb

# Tune XGBoost
best_model, best_params, best_rmse = hyperparameter_tuning_xgb(train_df)
print(f"Best RMSE: {best_rmse}")
print(f"Best parameters: {best_params}")

๐ŸŽจ Features

Data Processing

  • Feature Engineering: Distance calculations, time features, weather data
  • Normalization: Custom normalization for different feature types
  • Data Validation: Automatic data quality checks

Model Training

  • Multiple Algorithms: 7 different regression models
  • Hyperparameter Tuning: Automated XGBoost optimization
  • Cross-Validation: Built-in validation splits
  • Progress Tracking: Detailed logging with sandwich format

Evaluation

  • Comprehensive Metrics: RMSE, MAE, Rยฒ, MAPE
  • Visual Comparisons: Histogram comparisons, feature importance
  • Model Persistence: Save and load trained models

Output

  • Submission Files: Ready-to-submit CSV files
  • Visualizations: Plots and charts for analysis
  • Logging: Complete audit trail

๐Ÿ› Troubleshooting

Common Issues

  1. Missing Data Files

    FileNotFoundError: Data file not found
    

    Solution: Ensure all required data files are in the correct directories

  2. Import Errors

    ModuleNotFoundError: No module named 'xgboost'
    

    Solution: Install missing dependencies: pip install -r requirements.txt

  3. Memory Issues

    MemoryError: Unable to allocate array
    

    Solution: Reduce batch size or use fewer models

Getting Help

  • Check logs in logs/main.log for detailed error messages
  • Verify data files are in correct format and location
  • Ensure all dependencies are installed correctly

๐Ÿ“Š Performance

Typical model performance on validation set:

  • XGBoost: ~400-450 RMSE (best performer)
  • Random Forest: ~420-470 RMSE
  • Linear Models: ~450-500 RMSE
  • Neural Network: ~430-480 RMSE

๐Ÿ”ฎ Future Enhancements

  • Automated feature selection

  • Real-time prediction API

  • Model monitoring dashboard

  • A/B testing framework

    ๐Ÿ“„ License

    This project is licensed under the MIT License - see the LICENSE file for details.

    ๐Ÿค Contributing

    Please read CONTRIBUTING.md. By participating, you agree to abide by our CODE_OF_CONDUCT.md and report vulnerabilities per SECURITY.md.

    1. Fork the repository
    2. Create a feature branch
    3. Make your changes
    4. Add tests if applicable
    5. Submit a pull request

    ๐Ÿ“ž Support

For questions or issues, please:

  1. Check the logs first
  2. Review this documentation

About

End-to-end machine learning pipeline for trip duration prediction with feature engineering, regression models, and automated evaluation.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 67.0%
  • Python 17.3%
  • TypeScript 9.9%
  • JavaScript 5.3%
  • Other 0.5%