This README file provides an in-depth explanation of the code for the hackathon project. The project aims to predict waiting times at a certain venue by utilizing historical waiting time data, weather data, and additional temporal features. The prediction model is based on the XGBoost algorithm.
The project comprises several components:
- Python Code: The main code file contains the Python code for data preprocessing, feature engineering, model training, and prediction generation.
- Data Files: The project uses several CSV files:
waiting_times_train.csv: Contains historical waiting time data for training.waiting_times_X_test_final.csv: Contains waiting time data for generating predictions.weather_data.csv: Contains historical weather data.
- Output Files: After running the code, the following output files are generated:
predictions_final.csv: Contains the predictions generated by the model.
- Loading Data: The code loads the training waiting time data (
waiting_times_train.csv), testing waiting time data (waiting_times_X_test_final.csv), and weather data (weather_data.csv) using Pandas. - Feature Engineering: The
add_time_featuresfunction adds various temporal features to the datasets, such as day of week, month, hour, season, etc. This function is applied to all three datasets. - Merging Data: The code merges the waiting time data with the weather data based on the DATETIME column.
- Column Cleanup: After merging, columns with the suffix
_dropare removed from the datasets.
- Data Preparation: The code prepares the training and testing datasets by selecting relevant features and handling missing values.
- Model Definition: XGBoost Regressor model is defined with hyperparameters optimized for the task.
- Pipeline Creation: A pipeline is created to streamline the data preprocessing and modeling steps.
- Model Training: The pipeline is fitted to the training data.
- Prediction Generation: The model generates predictions for the testing data.
- Output Generation: Predictions are saved to a CSV file (
predictions_final.csv) along with relevant information such as DATETIME and ENTITY_DESCRIPTION_SHORT.
The project requires the following Python libraries:
- Pandas
- NumPy
- XGBoost
- Scikit-learn
- Ensure all required data files (
waiting_times_train.csv,waiting_times_X_test_final.csv,weather_data.csv) are placed in the same directory as the code file. - Install the necessary Python dependencies.
- Run the script.
- After execution, check the
predictions_final.csvfile for the generated predictions.
- The code is optimized for prediction accuracy using XGBoost regressor with carefully tuned hyperparameters.
- Additional optimization or modification may be required based on specific requirements or new data.
- For any inquiries or issues, please contact the project contributors.