Our team, in alphabetical order:
- Karan Khubdikar
- Mo Norouzi
- Nicole Bidwell
Welcome to the repository for the Airbnb Analysis.
With over 8 million active listings across more than 100,000 cities and towns, Airbnb boasts an extensive network of accommodations, offering travelers a wide range of unique stays (Airbnb, 2024). In this project, our team uses machine learning algorithms to predict listing prices using various property details like geographical location, room type, and review activity. We implement rigorous methods to analyze the data and build machine learning models, including exploratory data analysis, feature engineering, cross-validation, hyperparameter optimization, and feature selection. We explore several models including Ridge, Random Forest Regression, XGBoost, and LGBM Regressor, and incorporate Recursive Feature Elimination with Cross-validation. Furthermore, we explore SHAP values which provide valuable insights into feature importance and model interpretability. Airbnb and hosts could use this project to guide future listing prices and understand the factors that drive prices.
The pdf copy of the final report can be viewed here.
The dataset used in this project is the New York City Airbnb Open Data, which is located on Kaggle.
The project can be run locally using a virtual environment. All required dependencies are listed in the environment file. To set up the environment, run the pipeline, and build the report, follow the steps below.
- Clone the repository.
git clone https://github.com/MoNorouzi23/Airbnb_analysis.git
- Install the dependencies by running the following command from the root of the directory.
conda env create -f environment.yml
- Activate the virtual environment.
conda activate airbnb_analysis
- Run the relevent scrips using the Makefile.
make all
Note: this command (above) will run only the relevant scripts that are part of the main pipeline. This includes the scripts for generating EDA plots, performing feature engineering, preprocessing, data splitting, RFECV model training, evaluation, and creating SHAP plots. Additional scripts, or to run a script individually, can be done with the command: python -m src.<path>.<script name>. The outputs from all scripts (except for the processed data files and the random forest hyperparameter model) are already included in the repository.
- To build the report, run the following command from the root of the directory.
jb build report
You can delete the output generated by make all by running make clean. The output from scripts not included in the pipeline will remain.
All reports contained herein are licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) License. See the license file for more information.
The software code contained in this repository is licensed under the MIT license. See the license file for more information.
If you reuse or remix this content, please provide attribution and include a link to this webpage.
Airbnb. (2024). About us. URL: https://news.airbnb.com/about-us/
Hunter, J. D. (2007). Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering, 9(3), 90-95. URL: https://matplotlib.org/
Lundberg, S. M., & Lee, S. I. (2017). A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems (NIPS 2017). URL: https://shap.readthedocs.io/en/latest/
McKinney, W. (2010). Data Analysis with Python and Pandas. URL: https://pandas.pydata.org/
Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830. URL: https://scikit-learn.org/
Van der Walt, S., Colbert, S. C., & Varoquaux, G. (2011). The NumPy Array: A Structure for Efficient Numerical Computation. Computing in Science & Engineering, 13(2), 22-30. URL: https://numpy.org/
Vega, J., & Altair Development Team. (2017). Altair: A Declarative Statistical Visualization Library for Python. URL: https://altair-viz.github.io/