This repository contains an optimized recommendation system for the MovieLens dataset, with a specific focus on cold-start user scenarios using a hybrid approach combining DeepFM and Approximate Nearest Neighbors (ANN).
The recommendation system combines several state-of-the-art techniques:
- DeepFM Model: Combines factorization machines for recommendation with deep neural networks
- HNSW Approximate Nearest Neighbors: For fast candidate retrieval during recommendation
- Cold-start Optimization: Specialized handling for users with limited rating history
- Efficient Recommendations: ANN-based retrieval enables fast recommendations with minimal accuracy loss
- Cold-start Handling: Optimized for new users with limited rating history
- Comprehensive Evaluation: Leave-one-out methodology with relevant metrics (Hit Rate, MRR, NDCG)
- Memory-efficient Design: Batch processing for handling large datasets
- Model Persistence: Save/load capabilities for model weights and mappings
numpy
pandas
torch
hnswlib
scikit-learn
matplotlib
tqdm
# Load and preprocess data
ratings_df, movies_df = load_movielens_data("path/to/movielens")
data = preprocess_for_recommendation(ratings_df)
# Create recommendation system
rec_system = MovieLensRecommendationSystem(data["num_users"], data["num_items"])
rec_system.set_mapping(data["reverse_user_map"], data["reverse_movie_map"], movies_df, data["ratings_df"])
# Train model
rec_system.train(data["user_ids"], data["movie_ids"], data["labels"], epochs=10)
# Build ANN index for fast recommendations
rec_system.build_ann_index()
# Get recommendations for a user
recommendations = rec_system.recommend_items_ann(user_id, top_k=10)# Run cold-start evaluation
results = run_improved_coldstart_evaluation(
"path/to/movielens",
sample_size=None,
num_test_users=100,
test_ratio=0.2
)The code is organized into several key components:
-
Data Loading and Processing
load_movielens_data(): Loads and samples MovieLens datasetprepare_coldstart_evaluation(): Creates training/testing splits for cold-start evaluationpreprocess_for_recommendation(): Prepares data for the recommendation model
-
Core Models
DeepFM: Neural recommendation model combining factorization machines and deep learningHNSWIndex: Wrapper for HNSW approximate nearest neighbors indexCollisionlessEmbeddingTable: Embedding table with expiration mechanism
-
Recommendation System
MovieLensRecommendationSystem: Main class that integrates models for recommendations- Methods for both standard and ANN-based recommendations
-
Evaluation
create_leave_one_out_testset(): Creates test data for leave-one-out evaluationevaluate_leave_one_out(): Evaluates recommendation methods using leave-one-out methodologycompare_coldstart_methods_leave_one_out(): Compares different recommendation approaches
The system handles cold-start users with several strategies:
- Strategic initial rating selection based on user preferences
- Larger candidate pool during ANN retrieval for better recall
- Hybrid re-ranking approach combining popularity and similarity
- Optimized thresholds for seen/unseen item determination
The system is evaluated using:
- Hit Rate (HR@k): Percentage of users for whom the held-out item is in the top-k recommendations
- Mean Reciprocal Rank (MRR): Average of reciprocal ranks of the held-out items
- Normalized Discounted Cumulative Gain (NDCG): Measures the ranking quality of recommendations
Comparative evaluation between standard and ANN-based recommendation methods:
- ANN method is significantly faster (typically 5-10x speedup)
- Loss in recommendation quality (comparable Hit Rate, MRR, and NDCG)