Skip to content

aokhader/BLAIR-MultiModal

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Modal Next-Item Recommendation using BLAIR-MM (Text + Image Embeddings)

Team Members: Will L., Derek P., Abdulaziz K., Mustafa H.
Instructor: Julian McAuley


Project Overview

This project explores a next-item recommendation task using a multi-modal embedding model that combines domain-specific text representations with visual features.

We extend the BLAIR (Bridging Language and Items for Retrieval) framework by incorporating image embeddings through the CLIP vision encoder, creating a multimodal item representation we call BLaIR-CLIP. Additionally, this model naturally solves the cold-start problem because it relies on content rather than collaborative signal, which is a major improvement over the standard baselines.

Our Motivating Question:

Does incorporating visual information (product images) improve product search and recommendation performance compared to text-only methods?

An in-depth showcase of the model definition and evaluation can be found in our Jupyter notebook.

1. Predictive Task Definition

Task

Predict the next item a user will interact with, given their chronological interaction sequence. We treat this as a ranking task: given a user history, the model must rank the ground-truth "next item" higher than all other candidate products in the catalog.

Evaluation Metrics

  • Recall@10 / Recall@50: Proportion of test cases where the correct item appears in the top-K results.
  • NDCG@10: Normalized Discounted Cumulative Gain to measure ranking quality.
  • MRR: Mean Reciprocal Rank of the first relevant item.
  • AUC: Area Under the ROC Curve to evaluate separation of positives and negatives.

Baselines

We compare our multimodal approach against traditional methods, testing each under two conditions to isolate the impact of visual data:

  1. TF-IDF + Cosine Similarity (With and Without images in the dataset)
  2. Matrix Factorization (MF) (With and Without images in the dataset).
  3. BLaIR-CLIP Fusion: Our proposed dual-encoder model combining BLaIR (RoBERTa) and CLIP (ViT).

2. Dataset, EDA, and Preprocessing

Dataset

We use the Amazon Reviews 2023 dataset, specifically focusing on the Appliances category:

  • Metadata: 94,327 products containing titles, descriptions, feature lists, and image links
  • Reviews: 2,128,605 user interactions.
  • Average Rating: 4.22/5.0.

Preprocessing

  • Text Unification: Combined product title, description, and bulleted features into a single string.
  • User Filtering: Removed users with fewer than two interactions to allow for training and testing.
  • Temporal Splitting: Employed a Leave-One-Out strategy—the final interaction for each user is held out for testing.
  • Image Regularization: Images resized to 224x224 and normalized for the CLIP processor.

3. Modeling

Model Architecture — BLAIR-MM

The model is based on a Dual Encoder architecture, consisting of two separate neural networks:

  1. One for processing text
  2. One for processing images

These two towers encode their respective modalities into vectors in the same shared latent space. In this space, the goal is for matching text-image pairs to be close together, and mismatched pairs to be far apart.

Text Encoder — BLAIR

  • Base: RoBERTa-based transformer.
  • Output: 768-dimensional CLS embedding.

Image Encoder — CLIP

  • Base: OpenAI’s Contrastive Language-Image Pre-training (CLIP) ViT-B/32.
  • Output: 512-dimensional image embedding.

Fusion Module

  • Projections: Linear layers map both text and image vectors into a shared 512-dimensional space.
  • Fusion: Normalized dot-product similarity is used to combine the two networks.

For details on how to run the BLAIR + CLIP model, the details are found here.


4. Evaluation

Evaluation Protocol

The evaluation methodology is as follows. For each user in the test set, the following are taken:

  • Their single held-out positive item
  • And all other items in the catalog as negatives

The model is then asked to produce a ranking. The metrics computed are:

  • Recall@10: Whether the correct item appears in the top 10.
  • Recall@50: Looking slightly deeper.
  • AUC: How well the model separates the positive item from the negatives.
  • MRR: How quickly a user finds the first relevant item.
  • NDCG: The "holistic quality" of the ranking.

This evaluation setup is rigorous because the model is competing against thousands of possible negative items.
The following snippet comes from the ranking loop. It shows that predicted scores are taken, items the user has already interacted with are masked out, and then the rank of the single positive item is computed. This rank determines the Recall and AUC metrics. The important part is that this evaluation code is shared across all baselines, ensuring a fair comparison.

# Source: baseline_utils.py
for i, (user_id, gt_item) in enumerate(test_data):
    gt_index = self.asin_to_index[gt_item]
    
    scores = score_func(user_id) # Should return (N_items,)
    
    # Mask training items
    train_items = self.train_interactions[user_id]
    train_indices = [self.asin_to_index[a] for a in train_items if a in self.asin_to_index]
    
    scores[train_indices] = -np.inf
    scores[gt_index] = gt_score # Restore GT score
    
    # Rank
    higher_scores = (scores > gt_score).sum()
    rank = higher_scores + 1

Results

Our experiments show that neural multimodal embeddings peform better than the classical text-based models and collaborative methods on the Appliances dataset:

model_comparison

model_results

Key Findings

  • Neural Dominance: The BLaIR-CLIP model outperformed TF-IDF by approximately 6x and MF by 13x in Recall@10.
  • Image Impact: Visual features help disambiguate products where text descriptions are vague or generic.
  • Sparsity Handling: While Matrix Factorization struggled with the high sparsity of the interaction matrix (AUC ~0.48), the content-based BLaIR-CLIP model remained robust (AUC ~0.71+)

5. Related Work

Classical Recommender Models

  • Matrix Factorization
  • Bayesian Personalized Ranking (BPR)
  • First-order sequence models (last-item transitions)

Text-based Retrieval Methods

  • TF-IDF retrieval
  • item2vec (Skip-Gram)
  • Transformer text encoders

Multi-Modal Recommendation

  • VBPR
  • DeepStyle
  • CLIP-based retrieval
  • BLAIR (text-only embedding model)

Our Contribution

  • First multimodal extension of BLAIR using CLIP
  • Fusion of text + image for item representations
  • Sequential evaluation via next-item prediction

Project Structure Highlights

project/
│
├── README.md
├── model_showcase.html                # notebook detailing the modeling process
├── baseline_utils.py                  # utility file for splitting data and evaluating baseline models
│
├── baselines/
│   ├── baseline_mf.py                 # MF model class definition
│   ├── baseline_tfidf.py              # TF-IDF model class definition
│   ├── run_baselines.py               # trains and evaluates the baseline models with and without images
│  
├── encoders/
│   ├── clip_encoder.py                # encoder used for CLIP model
│
├── blair/
│   ├── multimodal/ 
│   ├── blair_clip.py                  # BLAIR-MM class definition
│   ├── sample_multimodal_data.py      # preprocess dataset for MM model

Conclusion

BLAIR-MM produces multimodal item embeddings by combining text (BLAIR) and image (CLIP) signals.
When integrated into MF, these embeddings significantly outperform classical baselines in next-item recommendation, especially under cold-start conditions. Our results demonstrate that visual data is a vital signal in recommendation systems, providing a significant performance boost over traditional text-only or interaction-only baselines.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 74.6%
  • Python 24.5%
  • Shell 0.9%