A multi-modal search engine using CLIP and a vector database to power both text-to-image and image-to-image search across a 44,000-item fashion catalog.
Course: ITAI 1378 - Computer Vision
Author: Alhassane Samassekou
Institution: Houston City College
Project Type: Final Project
https://drive.google.com/file/d/1xLhumZZUQtEVR3VZwlSVyYf30DyU7Ku6/view?usp=sharing
VectorSearch is a multi-modal search engine that enables users to search a fashion e-commerce catalog using both natural language text queries and image uploads. Built using CLIP (Contrastive Language-Image Pre-training) and FAISS vector database, this system allows users to find products through semantic search rather than relying on exact keyword matches.
- Natural Language Search: Query using descriptive phrases like "blue summer dress"
- Image-Based Search: Upload an image to find visually similar items
- Fast Retrieval: Sub-second query latency across 44,000+ products
- Semantic Understanding: Goes beyond exact keyword matching
Product discovery on e-commerce sites is fundamentally inefficient:
- Vocabulary Gap: Traditional search relies on exact text keywords but fails when users can't describe items precisely (e.g., "formal shirt" vs. "button-down oxford")
- Visual Complexity: Complex visual features are difficult to describe accurately with text alone
- Limited Discovery: Impossible to "search by inspiration" by uploading a photo of a style you saw
- E-commerce businesses - Lost sales due to poor search experience
- Online shoppers - Frustration from inability to find desired products
- Digital marketers - Reduced conversion rates and product discovery
Poor search leads directly to customer frustration and lost sales. This limits product discovery opportunities and reduces overall conversion rates, impacting both customer satisfaction and business revenue.
VectorSearch bridges the vocabulary gap by mapping both images and text into the same embedding space, enabling true multi-modal search capabilities.
Input: Text Query ("red dress") OR Image Query (upload.jpg)
↓
Step 1: CLIP Encoder
(Converts text or image query into 512-dimensional vector)
↓
Step 2: FAISS Vector Database
(Searches pre-computed 44k-image index for nearest neighbors)
↓
Output: Top 5 Similar Products
(Displays the most relevant retrieved images)
- CLIP Model: Maps both images and text to the same embedding space, enabling cross-modal retrieval
- FAISS Vector Database: Enables efficient similarity search across embeddings with sub-millisecond latency
- PyTorch & Transformers: Framework for model inference and deployment
VectorSearch employs a retrieve-and-rank pipeline for efficient multi-modal search:
User Query
(Text/Image)
CLIP Encoder
(512-dim)
L2 Normalize
FAISS Index
(44k vectors)
Top-K
Results
- Query Encoding: CLIP converts text or image input into a 512-dimensional vector embedding
- Normalization: L2 normalization enables cosine similarity search
- Vector Search: FAISS efficiently retrieves the K nearest neighbors from the pre-computed index
- Result Display: Top-K most similar products are returned to the user
Source: Fashion Product Images Dataset (Kaggle)
| Attribute | Details |
|---|---|
| Total Images | ~44,400 product images |
| Metadata | styles.csv with product descriptions |
| Labels | ProductDisplayName (e.g., "Blue T-Shirt") paired with images |
| Categories | Apparel, Footwear, Accessories, Personal Care, and more |
| Format | High-resolution product photos with white backgrounds |
- Image Processing: Load all ~44K images through pre-trained CLIP model
- Embedding Generation: Generate 512-dimensional vector embeddings for each image
- Index Building: Store embeddings in FAISS index for efficient retrieval
- Persistence: Save index to disk for fast loading during inference
| Technology | Purpose | Version |
|---|---|---|
| PyTorch | Deep learning framework | 2.0+ |
| Hugging Face Transformers | CLIP model implementation | 4.30+ |
| FAISS | Efficient vector similarity search | 1.7.4+ |
| Gradio | Interactive web demo UI | 4.0+ (optional) |
| Pandas | Data processing and management | 2.0+ |
| NumPy | Numerical computations | 1.24+ |
- Platform: Google Colab (Free Tier compatible)
- GPU: Free tier GPU sufficient for inference
- RAM: 8GB minimum, 16GB recommended
- Storage: ~5GB for dataset and embeddings
- Cost: $0 - Designed for free-tier resources
- CLIP: Industry standard for mapping images and text to the same embedding space
- FAISS: Highly efficient, lightweight library perfect for prototyping in Colab environments
- PyTorch: Most widely used deep learning framework with excellent community support
- Gradio: Enables rapid prototyping of interactive demos with minimal code
- Python 3.8 or higher
- CUDA-compatible GPU (optional, for faster inference)
- 8GB RAM minimum
- 5GB free disk space
git clone https://github.com/asamassekou10/ITAI-1378-FINAL_VectorSearch.git
cd ITAI-1378-FINAL_VectorSearch# Using venv
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Or using conda
conda create -n vectorsearch python=3.10
conda activate vectorsearchpip install -r requirements.txtYou'll need a Kaggle API key to download the Fashion Product Images dataset.
- Get your Kaggle API key from kaggle.com/account
- Place
kaggle.jsonin the project root directory
# Install Kaggle CLI
pip install kaggle
# Download dataset
kaggle datasets download -d paramaggarwal/fashion-product-images-small
unzip fashion-product-images-small.zipThe project consists of three main notebooks that should be run in sequence:
jupyter notebook 01_setup_and_exploration_Alhassane_Samassekou.ipynbPurpose:
- Load and explore the Fashion Product dataset
- Analyze data distribution and quality
- Prepare metadata for embedding generation
jupyter notebook 02_embedding_pipeline_Alhassane_Samassekou.ipynbPurpose:
- Generate CLIP embeddings for all product images
- Create and save the FAISS index (
image_embeddings.npy) - Validate embedding quality ⏱ Processing Time: ~30-60 minutes for 44k images on CPU
jupyter notebook 03_search_demo_Alhassane_Samassekou_Professional.ipynbPurpose:
- Load pre-computed embeddings and FAISS index
- Initialize the CLIP model for query encoding
- Launch the interactive Gradio web interface
If you prefer command-line scripts over notebooks:
python src/generate_embeddings.py --data_path ./data/images --output_path ./embeddings/python src/build_index.py --embeddings_path ./embeddings/image_embeddings.npyText-to-Image Search:
python src/search.py --query "blue summer dress" --top_k 5Image-to-Image Search:
python src/search.py --image_path ./query_image.jpg --top_k 5python src/demo.pyThe Gradio interface will launch at http://localhost:7860
Text Search:
- Enter a description in the "Text Query" field (e.g., "red formal dress")
- Select a category filter (optional)
- Click " Search" Image Search:
- Upload an image using the "Image Query" panel
- Select a category filter (optional)
- Click " Search" Note: Image queries take precedence over text when both are provided.
vectorsearch/
README.md
notebooks/
01_setup_and_exploration_Alhassane_Samassekou.ipynb
02_embedding_pipeline_Alhassane_Samassekou.ipynb
03_search_demo_Alhassane_Samassekou_Professional.ipynb
docs/
Présentation.pdf
AI_usage_log.md
Problem: 44K image embedding process may exceed Colab session limits
Solution Implemented:
- Batch processing with checkpoint saving every 5,000 images
- Reduced batch size to 32 for memory efficiency
- Added progress tracking with tqdm Backup Plan: Process images in smaller chunks (5K at a time) or reduce dataset to 10K images
Problem: Retrieved results might not match query intent
Solution Implemented:
- Used pre-trained CLIP ViT-B/32 (proven for fashion domain)
- Implemented proper L2 normalization for cosine similarity
- Added category filtering for targeted results Backup Plan: Experiment with ViT-L/14 or domain-specific CLIP variants from HuggingFace
Problem: Gradio interface might be buggy or crash
Solution Implemented:
- Comprehensive error handling with try-except blocks
- Graceful degradation for missing images
- Input validation for empty queries Backup Plan: Revert to simple Python function calls in Colab notebook if Gradio fails
This project demonstrates mastery of:
- Multi-Modal Machine Learning
- Understanding CLIP architecture and pre-training methodology
- Cross-modal embedding alignment between text and images
- Vector Embeddings & Similarity Search
- Generating high-dimensional vector representations
- Efficient similarity search with FAISS indexing
- System Design & Optimization
- Batch processing for large-scale data
- GPU acceleration and memory management
- Building scalable search pipelines
- Real-World Applications
- Solving practical e-commerce challenges
- Building production-ready prototypes
- User interface design for AI systems
- Software Engineering
- Version control with Git/GitHub
- Documentation and code organization
- Reproducible research practices
# Load CLIP model
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Generate image embeddings
for img_path in image_paths:
image = Image.open(img_path)
inputs = processor(images=image, return_tensors="pt")
embedding = model.get_image_features(**inputs)
embeddings.append(embedding.detach().numpy())import faiss
# Initialize FAISS index for Inner Product similarity
dimension = 512
index = faiss.IndexFlatIP(dimension)
# Add normalized embeddings
embeddings = np.array(embeddings)
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
index.add(embeddings)def search(query, category_filter="All", k=5):
# Encode query (text or image)
if isinstance(query, str):
inputs = processor(text=[query], return_tensors="pt")
vector = model.get_text_features(**inputs)
else:
inputs = processor(images=query, return_tensors="pt")
vector = model.get_image_features(**inputs)
# Normalize and search
vector = vector / vector.norm(p=2, dim=-1, keepdim=True)
vector = vector.cpu().detach().numpy()
# Oversample for filtering
search_k = k * 10 if category_filter != "All" else k
distances, indices = index.search(vector, search_k)
# Filter by category and format results
results = []
for idx, score in zip(indices[0], distances[0]):
item = df.iloc[idx]
if category_filter == "All" or item['masterCategory'] == category_filter:
results.append({
'image': item['image_path'],
'caption': item['productDisplayName'],
'score': score
})
if len(results) >= k:
break
return resultsWhen category filters are applied, the system uses 10x oversampling to ensure sufficient results:
Without Filter: Retrieve top 5 → Return 5 results
With Filter: Retrieve top 50 → Filter by category → Return top 5 matches
| Metric | Value | Details |
|---|---|---|
| Dataset Size | 44,419 products | Enterprise-scale catalog |
| Vector Dimension | 512 | Balanced accuracy/speed |
| Query Latency | < 0.1 seconds | Real-time performance |
| Index Type | Flat (Exact) | 100% accuracy on retrieval |
| Search Modalities | 2 (Text + Image) | Multi-modal queries |
| Categories | 7 types | Apparel, Footwear, Accessories, etc. |
| Embedding Model | CLIP ViT-B/32 | 400M parameters |
| GPU Memory | ~2GB | Inference requirements |
Text-to-Image Search:
- Semantic color understanding (e.g., "red" retrieves red items)
- Style matching (e.g., "formal shirt" finds dress shirts)
- Cross-category queries work effectively Image-to-Image Search:
- Visual similarity matching (e.g., watches find similar watches)
- Pattern recognition across product types
- Robust to image quality variations
- Hybrid Search
- Combine vector similarity with BM25 keyword search
- Implement weighted fusion of semantic and lexical matching
- Re-Ranking
- Add cross-encoder stage for top-K refinement
- Improve result ordering accuracy
- Fine-Tuning
- Train CLIP on fashion-specific data
- Better understanding of domain terminology