Python notebooks that clean the MovieLens ratings data, explore the catalog, and train recommender models ranging from bias-only baselines to a LightGBM regressor.
Generated artifacts:
ratings_without_timestamp.csv: ratings with the timestamp column dropped.df_movies_final.csv: movie metadata with cleaned titles, extractedyear, and genre one-hot columns.df_movies_with_score.csv: mergesdf_movies_finalwith per-movie mean/median ratings and interaction counts.
- Preprocess metadata: run
preprocess.ipynbto createdf_movies_final.csvandratings_without_timestamp.csv. - Build movie-level stats: run
Rating.ipynbto createdf_movies_with_score.csv. - Explore: run
analysis.ipynbfor plots on popularity, genre influence, and rating distributions. - Train models: run
training.ipynb. It shuffles ratings, splits 60/20/20 (train/valid/test), and fits:- Bias-only baseline (user/item biases over global mean).
- Latent factor model with user/item biases plus K latent dimensions.
MLPRegressorusing user/movie averages, encoded year, and genre indicators.LightGBMRegressoron the same feature set.
- LightGBM on a 5k-user subset: ~0.685 MSE when including the
yearfeature; ~0.688 withoutyear. - LightGBM on a 2k-user subset without
year: ~0.825 MSE. Further metrics (bias-only and latent factors) are printed intraining.ipynbduring execution.
For a concise narrative of the experiments and results, see the PDF report: Exploring User-Movie Interactions and Metadata for Rating Prediction.pdf.