Project for tabular data science course. By Liel Gutman and Yaniv Rotics.
The project aims to improve the selection of hyperparameters in Kernel Density Estimation, specifically the bandwidth parameter which impacts the smoothness of the estimation. The proposed solution involves using different bandwidths including "scott" and "silverman" rules-of-thumb, and Grid Search Cross-Validation which is an AutoML method available in Scikit Learn that estimates the best hyperparameters. The performance was evaluated using mean squared error (MSE) and total log-likelihood metrics. The performance of the best methods was tested on known distributions and four different datasets as well.
The final report can be found here.
- Avocado prices (kaggle)
- Spotify Songs (kaggle)
- Action movie ratings imdb (kaggle)
- NASA - Nearest Earth Objects (kaggle)
To run the project, please install the requirements and run the jupyter notebook.