A comprehensive machine learning pipeline for predicting trip durations using various regression models, feature engineering, and hyperparameter optimization.
Medium post - https://medium.com/@hphadtare02/how-machine-learning-predicts-trip-duration-just-like-uber-zomato-91f7db6e9ce9
GoPredict/
โโโ main.py # Main runner script
โโโ config.py # Project configuration
โโโ requirements.txt # Python dependencies
โโโ README.md # This file
โ
โโโ data/ # Data directory
โ โโโ raw/ # Raw data files
โ โ โโโ train.csv # Training data
โ โ โโโ test.csv # Test data
โ โโโ processed/ # Processed data files
โ โ โโโ feature_engineered_train.csv
โ โ โโโ feature_engineered_test.csv
โ โ โโโ gmapsdata/ # Google Maps data
โ โโโ external/ # External data sources
โ โโโ precipitation.csv # Weather data
โ
โโโ src/ # Source code
โ โโโ model/ # Model-related modules
โ โ โโโ models.py # All ML models and pipeline
โ โ โโโ evaluation.py # Model evaluation functions
โ โ โโโ save_models.py # Model persistence
โ โโโ features/ # Feature engineering modules
โ โ โโโ distance.py # Distance calculations
โ โ โโโ geolocation.py # Geographic features
โ โ โโโ gmaps.py # Google Maps integration
โ โ โโโ precipitation.py # Weather features
โ โ โโโ time.py # Time-based features
โ โโโ feature_pipe.py # Feature engineering pipeline
โ โโโ data_preprocessing.py # Data preprocessing
โ โโโ complete_pipeline_example.py # Usage examples
โ
โโโ notebooks/ # Jupyter notebooks
โ โโโ 01_EDA.ipynb # Exploratory Data Analysis
โ โโโ 02_Feature_Engineering.ipynb # Feature engineering
โ โโโ 03_Model_Training.ipynb # Model training
โ โโโ figures/ # Generated plots
โ โโโ gmaps/ # Interactive maps
โ
โโโ saved_models/ # Trained models (auto-created)
โโโ output/ # Predictions and submissions (auto-created)
โโโ logs/ # Log files (auto-created)
# Clone the repository
git clone <your-repo-url>
cd GoPredict
# Install dependencies
pip install -r requirements.txt
# Create necessary directories
mkdir -p logs output saved_modelsEnsure you have the following data files in place:
data/raw/train.csv- Training datadata/raw/test.csv- Test datadata/external/precipitation.csv- Weather data
# Run COMPLETE end-to-end pipeline (RECOMMENDED)
python main.py --mode complete
# Run complete pipeline with all models (assumes feature engineering is done)
python main.py --mode full
# Train specific models only (assumes feature engineering is done)
python main.py --mode train --models LINREG,RIDGE,XGB
# Make predictions only (assumes feature engineering is done)
python main.py --mode predict --models XGB
# Hyperparameter tuning only (assumes feature engineering is done)
python main.py --mode tune
# Enable XGBoost hyperparameter tuning
python main.py --mode complete --tune-xgb| Model | Code | Description |
|---|---|---|
| Linear Regression | LINREG |
Baseline linear model |
| Ridge Regression | RIDGE |
Linear with L2 regularization |
| Lasso Regression | LASSO |
Linear with L1 regularization |
| Support Vector Regression | SVR |
Support vector machine |
| XGBoost | XGB |
Gradient boosting (best performer) |
| Random Forest | RF |
Ensemble of decision trees |
| Neural Network | NN |
Deep learning model |
python main.pyRuns the complete end-to-end pipeline:
- Data preprocessing - Loads and cleans raw data
- Feature engineering - Adds distance, time, cluster, and weather features
- Model training - Trains all specified models
- Model evaluation - Compares model performance
- Prediction generation - Creates submission files
python main.py --models XGB,RFTrain only specific models.
python main.py --tune-xgbEnable XGBoost hyperparameter tuning.
output/[model_name]/test_prediction_YYYYMMDD_HHMMSS.csv- Ready-to-submit prediction files with timestamps
saved_models/[model_name]_YYYYMMDD_HHMMSS.pkl- Trained models with metadata
logs/main.log- Complete pipeline execution log- Detailed progress tracking and metrics
output/prediction_comparison_YYYYMMDD_HHMMSS.png- Model comparison plots
- Feature importance plots
Edit config.py to customize:
- Model parameters
- Data paths
- Output directories
- Hyperparameter tuning ranges
- Logging settings
from src.model.models import run_complete_pipeline
import pandas as pd
# Load data
train_df = pd.read_csv('data/processed/feature_engineered_train.csv')
test_df = pd.read_csv('data/processed/feature_engineered_test.csv')
# Run complete pipeline
results = run_complete_pipeline(
train_df=train_df,
test_df=test_df,
models_to_run=['LINREG', 'RIDGE', 'XGB'],
tune_xgb=True,
create_submission=True
)from src.model.models import run_regression_models, predict_duration, to_submission
# Train models
models = run_regression_models(train_df, ['XGB', 'RF'])
# Make predictions
predictions = predict_duration(models['XGBoost'], test_df)
# Create submission
submission_file = to_submission(predictions)from src.model.models import hyperparameter_tuning_xgb
# Tune XGBoost
best_model, best_params, best_rmse = hyperparameter_tuning_xgb(train_df)
print(f"Best RMSE: {best_rmse}")
print(f"Best parameters: {best_params}")- Feature Engineering: Distance calculations, time features, weather data
- Normalization: Custom normalization for different feature types
- Data Validation: Automatic data quality checks
- Multiple Algorithms: 7 different regression models
- Hyperparameter Tuning: Automated XGBoost optimization
- Cross-Validation: Built-in validation splits
- Progress Tracking: Detailed logging with sandwich format
- Comprehensive Metrics: RMSE, MAE, Rยฒ, MAPE
- Visual Comparisons: Histogram comparisons, feature importance
- Model Persistence: Save and load trained models
- Submission Files: Ready-to-submit CSV files
- Visualizations: Plots and charts for analysis
- Logging: Complete audit trail
-
Missing Data Files
FileNotFoundError: Data file not foundSolution: Ensure all required data files are in the correct directories
-
Import Errors
ModuleNotFoundError: No module named 'xgboost'Solution: Install missing dependencies:
pip install -r requirements.txt -
Memory Issues
MemoryError: Unable to allocate arraySolution: Reduce batch size or use fewer models
- Check logs in
logs/main.logfor detailed error messages - Verify data files are in correct format and location
- Ensure all dependencies are installed correctly
Typical model performance on validation set:
- XGBoost: ~400-450 RMSE (best performer)
- Random Forest: ~420-470 RMSE
- Linear Models: ~450-500 RMSE
- Neural Network: ~430-480 RMSE
-
Automated feature selection
-
Real-time prediction API
-
Model monitoring dashboard
-
A/B testing framework
This project is licensed under the MIT License - see the LICENSE file for details.
Please read CONTRIBUTING.md. By participating, you agree to abide by our CODE_OF_CONDUCT.md and report vulnerabilities per SECURITY.md.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
For questions or issues, please:
- Check the logs first
- Review this documentation