Multi-Modal Next-Item Recommendation using BLAIR-MM (Text + Image Embeddings)

Team Members: Will L., Derek P., Abdulaziz K., Mustafa H.
Instructor: Julian McAuley

Project Overview

This project explores a next-item recommendation task using a multi-modal embedding model that combines domain-specific text representations with visual features.

We extend the BLAIR (Bridging Language and Items for Retrieval) framework by incorporating image embeddings through the CLIP vision encoder, creating a multimodal item representation we call BLaIR-CLIP. Additionally, this model naturally solves the cold-start problem because it relies on content rather than collaborative signal, which is a major improvement over the standard baselines.

Our Motivating Question:

Does incorporating visual information (product images) improve product search and recommendation performance compared to text-only methods?

An in-depth showcase of the model definition and evaluation can be found in our Jupyter notebook.

1. Predictive Task Definition

Task

Predict the next item a user will interact with, given their chronological interaction sequence. We treat this as a ranking task: given a user history, the model must rank the ground-truth "next item" higher than all other candidate products in the catalog.

Evaluation Metrics

Recall@10 / Recall@50: Proportion of test cases where the correct item appears in the top-K results.
NDCG@10: Normalized Discounted Cumulative Gain to measure ranking quality.
MRR: Mean Reciprocal Rank of the first relevant item.
AUC: Area Under the ROC Curve to evaluate separation of positives and negatives.

Baselines

We compare our multimodal approach against traditional methods, testing each under two conditions to isolate the impact of visual data:

TF-IDF + Cosine Similarity (With and Without images in the dataset)
Matrix Factorization (MF) (With and Without images in the dataset).
BLaIR-CLIP Fusion: Our proposed dual-encoder model combining BLaIR (RoBERTa) and CLIP (ViT).

2. Dataset, EDA, and Preprocessing

Dataset

We use the Amazon Reviews 2023 dataset, specifically focusing on the Appliances category:

Metadata: 94,327 products containing titles, descriptions, feature lists, and image links
Reviews: 2,128,605 user interactions.
Average Rating: 4.22/5.0.

Preprocessing

Text Unification: Combined product title, description, and bulleted features into a single string.
User Filtering: Removed users with fewer than two interactions to allow for training and testing.
Temporal Splitting: Employed a Leave-One-Out strategy—the final interaction for each user is held out for testing.
Image Regularization: Images resized to 224x224 and normalized for the CLIP processor.

3. Modeling

Model Architecture — BLAIR-MM

The model is based on a Dual Encoder architecture, consisting of two separate neural networks:

One for processing text
One for processing images

These two towers encode their respective modalities into vectors in the same shared latent space. In this space, the goal is for matching text-image pairs to be close together, and mismatched pairs to be far apart.

Text Encoder — BLAIR

Base: RoBERTa-based transformer.
Output: 768-dimensional CLS embedding.

Image Encoder — CLIP

Base: OpenAI’s Contrastive Language-Image Pre-training (CLIP) ViT-B/32.
Output: 512-dimensional image embedding.

Fusion Module

Projections: Linear layers map both text and image vectors into a shared 512-dimensional space.
Fusion: Normalized dot-product similarity is used to combine the two networks.

For details on how to run the BLAIR + CLIP model, the details are found here.

4. Evaluation

Evaluation Protocol

The evaluation methodology is as follows. For each user in the test set, the following are taken:

Their single held-out positive item
And all other items in the catalog as negatives

The model is then asked to produce a ranking. The metrics computed are:

Recall@10: Whether the correct item appears in the top 10.
Recall@50: Looking slightly deeper.
AUC: How well the model separates the positive item from the negatives.
MRR: How quickly a user finds the first relevant item.
NDCG: The "holistic quality" of the ranking.

This evaluation setup is rigorous because the model is competing against thousands of possible negative items.
The following snippet comes from the ranking loop. It shows that predicted scores are taken, items the user has already interacted with are masked out, and then the rank of the single positive item is computed. This rank determines the Recall and AUC metrics. The important part is that this evaluation code is shared across all baselines, ensuring a fair comparison.

# Source: baseline_utils.py
for i, (user_id, gt_item) in enumerate(test_data):
    gt_index = self.asin_to_index[gt_item]
    
    scores = score_func(user_id) # Should return (N_items,)
    
    # Mask training items
    train_items = self.train_interactions[user_id]
    train_indices = [self.asin_to_index[a] for a in train_items if a in self.asin_to_index]
    
    scores[train_indices] = -np.inf
    scores[gt_index] = gt_score # Restore GT score
    
    # Rank
    higher_scores = (scores > gt_score).sum()
    rank = higher_scores + 1

Results

Our experiments show that neural multimodal embeddings peform better than the classical text-based models and collaborative methods on the Appliances dataset:

Key Findings

Neural Dominance: The BLaIR-CLIP model outperformed TF-IDF by approximately 6x and MF by 13x in Recall@10.
Image Impact: Visual features help disambiguate products where text descriptions are vague or generic.
Sparsity Handling: While Matrix Factorization struggled with the high sparsity of the interaction matrix (AUC ~0.48), the content-based BLaIR-CLIP model remained robust (AUC ~0.71+)

5. Related Work

Classical Recommender Models

Matrix Factorization
Bayesian Personalized Ranking (BPR)
First-order sequence models (last-item transitions)

Text-based Retrieval Methods

TF-IDF retrieval
item2vec (Skip-Gram)
Transformer text encoders

Multi-Modal Recommendation

VBPR
DeepStyle
CLIP-based retrieval
BLAIR (text-only embedding model)

Our Contribution

First multimodal extension of BLAIR using CLIP
Fusion of text + image for item representations
Sequential evaluation via next-item prediction

Project Structure Highlights

project/
│
├── README.md
├── model_showcase.html                # notebook detailing the modeling process
├── baseline_utils.py                  # utility file for splitting data and evaluating baseline models
│
├── baselines/
│   ├── baseline_mf.py                 # MF model class definition
│   ├── baseline_tfidf.py              # TF-IDF model class definition
│   ├── run_baselines.py               # trains and evaluates the baseline models with and without images
│  
├── encoders/
│   ├── clip_encoder.py                # encoder used for CLIP model
│
├── blair/
│   ├── multimodal/ 
│   ├── blair_clip.py                  # BLAIR-MM class definition
│   ├── sample_multimodal_data.py      # preprocess dataset for MM model

Conclusion

BLAIR-MM produces multimodal item embeddings by combining text (BLAIR) and image (CLIP) signals.
When integrated into MF, these embeddings significantly outperform classical baselines in next-item recommendation, especially under cold-start conditions. Our results demonstrate that visual data is a vital signal in recommendation systems, providing a significant performance boost over traditional text-only or interaction-only baselines.

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
BLAIR-CLIP-dataset		BLAIR-CLIP-dataset
amazon-c4		amazon-c4
assets		assets
baseline_models		baseline_models
baselines		baselines
benchmark_scripts		benchmark_scripts
blair		blair
encoders		encoders
product_search_results		product_search_results
seq_rec_results		seq_rec_results
.gitignore		.gitignore
Amazon Reviews README.md		Amazon Reviews README.md
LICENSE		LICENSE
README.md		README.md
baseline_utils.py		baseline_utils.py
filter_no_images.py		filter_no_images.py
model_showcase.ipynb		model_showcase.ipynb
result.py		result.py
run_mf_with_images.py		run_mf_with_images.py
run_tfidf_with_images.py		run_tfidf_with_images.py
test_notebook.ipynb		test_notebook.ipynb
video_script.md		video_script.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Modal Next-Item Recommendation using BLAIR-MM (Text + Image Embeddings)

Project Overview

An in-depth showcase of the model definition and evaluation can be found in our Jupyter notebook.

1. Predictive Task Definition

Task

Evaluation Metrics

Baselines

2. Dataset, EDA, and Preprocessing

Dataset

Preprocessing

3. Modeling

Model Architecture — BLAIR-MM

Text Encoder — BLAIR

Image Encoder — CLIP

Fusion Module

4. Evaluation

Evaluation Protocol

Results

Key Findings

5. Related Work

Classical Recommender Models

Text-based Retrieval Methods

Multi-Modal Recommendation

Our Contribution

Project Structure Highlights

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multi-Modal Next-Item Recommendation using BLAIR-MM (Text + Image Embeddings)

Project Overview

An in-depth showcase of the model definition and evaluation can be found in our Jupyter notebook.

1. Predictive Task Definition

Task

Evaluation Metrics

Baselines

2. Dataset, EDA, and Preprocessing

Dataset

Preprocessing

3. Modeling

Model Architecture — BLAIR-MM

Text Encoder — BLAIR

Image Encoder — CLIP

Fusion Module

4. Evaluation

Evaluation Protocol

Results

Key Findings

5. Related Work

Classical Recommender Models

Text-based Retrieval Methods

Multi-Modal Recommendation

Our Contribution

Project Structure Highlights

Conclusion

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages