Team Members: Will L., Derek P., Abdulaziz K., Mustafa H.
Instructor: Julian McAuley
This project explores a next-item recommendation task using a multi-modal embedding model that combines domain-specific text representations with visual features.
We extend the BLAIR (Bridging Language and Items for Retrieval) framework by incorporating image embeddings through the CLIP vision encoder, creating a multimodal item representation we call BLaIR-CLIP. Additionally, this model naturally solves the cold-start problem because it relies on content rather than collaborative signal, which is a major improvement over the standard baselines.
Our Motivating Question:
Does incorporating visual information (product images) improve product search and recommendation performance compared to text-only methods?
An in-depth showcase of the model definition and evaluation can be found in our Jupyter notebook.
Predict the next item a user will interact with, given their chronological interaction sequence. We treat this as a ranking task: given a user history, the model must rank the ground-truth "next item" higher than all other candidate products in the catalog.
- Recall@10 / Recall@50: Proportion of test cases where the correct item appears in the top-K results.
- NDCG@10: Normalized Discounted Cumulative Gain to measure ranking quality.
- MRR: Mean Reciprocal Rank of the first relevant item.
- AUC: Area Under the ROC Curve to evaluate separation of positives and negatives.
We compare our multimodal approach against traditional methods, testing each under two conditions to isolate the impact of visual data:
- TF-IDF + Cosine Similarity (With and Without images in the dataset)
- Matrix Factorization (MF) (With and Without images in the dataset).
- BLaIR-CLIP Fusion: Our proposed dual-encoder model combining BLaIR (RoBERTa) and CLIP (ViT).
We use the Amazon Reviews 2023 dataset, specifically focusing on the Appliances category:
- Metadata: 94,327 products containing titles, descriptions, feature lists, and image links
- Reviews: 2,128,605 user interactions.
- Average Rating: 4.22/5.0.
- Text Unification: Combined product title, description, and bulleted features into a single string.
- User Filtering: Removed users with fewer than two interactions to allow for training and testing.
- Temporal Splitting: Employed a Leave-One-Out strategy—the final interaction for each user is held out for testing.
- Image Regularization: Images resized to 224x224 and normalized for the CLIP processor.
The model is based on a Dual Encoder architecture, consisting of two separate neural networks:
- One for processing text
- One for processing images
These two towers encode their respective modalities into vectors in the same shared latent space. In this space, the goal is for matching text-image pairs to be close together, and mismatched pairs to be far apart.
- Base: RoBERTa-based transformer.
- Output: 768-dimensional CLS embedding.
- Base: OpenAI’s Contrastive Language-Image Pre-training (CLIP) ViT-B/32.
- Output: 512-dimensional image embedding.
- Projections: Linear layers map both text and image vectors into a shared 512-dimensional space.
- Fusion: Normalized dot-product similarity is used to combine the two networks.
For details on how to run the BLAIR + CLIP model, the details are found here.
The evaluation methodology is as follows. For each user in the test set, the following are taken:
- Their single held-out positive item
- And all other items in the catalog as negatives
The model is then asked to produce a ranking. The metrics computed are:
- Recall@10: Whether the correct item appears in the top 10.
- Recall@50: Looking slightly deeper.
- AUC: How well the model separates the positive item from the negatives.
- MRR: How quickly a user finds the first relevant item.
- NDCG: The "holistic quality" of the ranking.
This evaluation setup is rigorous because the model is competing against thousands of possible negative items.
The following snippet comes from the ranking loop. It shows that predicted scores are taken, items the user has already interacted with are masked out, and then the rank of the single positive item is computed. This rank determines the Recall and AUC metrics. The important part is that this evaluation code is shared across all baselines, ensuring a fair comparison.
# Source: baseline_utils.py
for i, (user_id, gt_item) in enumerate(test_data):
gt_index = self.asin_to_index[gt_item]
scores = score_func(user_id) # Should return (N_items,)
# Mask training items
train_items = self.train_interactions[user_id]
train_indices = [self.asin_to_index[a] for a in train_items if a in self.asin_to_index]
scores[train_indices] = -np.inf
scores[gt_index] = gt_score # Restore GT score
# Rank
higher_scores = (scores > gt_score).sum()
rank = higher_scores + 1Our experiments show that neural multimodal embeddings peform better than the classical text-based models and collaborative methods on the Appliances dataset:
- Neural Dominance: The BLaIR-CLIP model outperformed TF-IDF by approximately 6x and MF by 13x in Recall@10.
- Image Impact: Visual features help disambiguate products where text descriptions are vague or generic.
- Sparsity Handling: While Matrix Factorization struggled with the high sparsity of the interaction matrix (AUC ~0.48), the content-based BLaIR-CLIP model remained robust (AUC ~0.71+)
- Matrix Factorization
- Bayesian Personalized Ranking (BPR)
- First-order sequence models (last-item transitions)
- TF-IDF retrieval
- item2vec (Skip-Gram)
- Transformer text encoders
- VBPR
- DeepStyle
- CLIP-based retrieval
- BLAIR (text-only embedding model)
- First multimodal extension of BLAIR using CLIP
- Fusion of text + image for item representations
- Sequential evaluation via next-item prediction
project/
│
├── README.md
├── model_showcase.html # notebook detailing the modeling process
├── baseline_utils.py # utility file for splitting data and evaluating baseline models
│
├── baselines/
│ ├── baseline_mf.py # MF model class definition
│ ├── baseline_tfidf.py # TF-IDF model class definition
│ ├── run_baselines.py # trains and evaluates the baseline models with and without images
│
├── encoders/
│ ├── clip_encoder.py # encoder used for CLIP model
│
├── blair/
│ ├── multimodal/
│ ├── blair_clip.py # BLAIR-MM class definition
│ ├── sample_multimodal_data.py # preprocess dataset for MM model
BLAIR-MM produces multimodal item embeddings by combining text (BLAIR) and image (CLIP) signals.
When integrated into MF, these embeddings significantly outperform classical baselines in next-item recommendation, especially under cold-start conditions. Our results demonstrate that visual data is a vital signal in recommendation systems, providing a significant performance boost over traditional text-only or interaction-only baselines.

