A Python Data Science project that leverages NumPy, Pandas, Matplotlib to implement an item-item collaborative filtering algorithm on a 3,000+ item dataset. This is a group project worked on by myself, Kelvin Bian, and Kishan Patel.
numpymatplotlibpandasjsonsklearnscipywarningsgzipsurprisetqdm
The dataset we worked with is the Amazon Luxury Beauty Dataset, which is a collection of 3,000+ items such as perfume cologne, makeup, foundation, brush, and skincare products.
In e-commerce, items are more stable than users, whose preferences change frequently. This makes items a reliable basis for recommendations.
- Determine Item Signatures: Based on user ratings in the user-item matrix.
- Find Similar Items: Identify items rated by the user that are similar to the target item.
- Predict Ratings: Calculate the weighted sum of ratings for similar items to generate a recommendation.
This method leverages item stability to provide consistent and personalized recommendations.
The dataset is filtered to retain only relevant metrics:
asin: Product IDreviewerID: User IDoverall: Rating
The data is grouped by reviewerID to analyze user-specific interactions. A random seed is set to ensure the reproducibility of results across different runs of the model. This guarantees consistent outputs for evaluation and comparison.
Split data into training data set (80%) and testing data set (20%).
Handle duplicate (user, item) pairs by averaging all of a user's reviews for a specific item.
- Compute an Item-Item similarity matrix across all items using cosine similarity as the similarity metric.
- Make predictions on test set - for each item in test set, choose 5 most similar items and compute weighted average of corresponding similar items
Use accuracy metrics such as RMSE and MAE to gauge model accuracy.
Based on the information derived from Item-Item CF, recommend 10 items to each user.
Use metrics such as Precision, Recall, and NDCG to assess quality of recommendations.
Compare this approach with other approaches:
- Item-Item Collaborative Filtering with a Baseline estimate derived from global mean, user deviation, and item deviation.
- Content based filtering, incorporating TF-IDF.
- SVD - Singular Value Decomposition to construct matrices that encapsulates the patterns of the data, and tunes the values to fit the dataset.
itemFilter.py: Implements the Model described in Steps 1-5 - Item-Item Collaborative FilteringitemFilter_baseline.py: Builds off ofitemFilter.pyand provides baseline estimates for items with null predictions due to cold start problemcontentCF.py: Implements content based collaborative filteringSVDFilter.py: Implements SVD-based filteringrecommendations.txt: recommendations of top 10 items for each user derived fromitemFilter.py