Skip to content

Visual Search Engine using COCO Dataset and CLIP

License

Notifications You must be signed in to change notification settings

karthiksuki/VLLM

Repository files navigation

VLLM

This project implements a Visual Language Model (VLM)-powered multi-modal search engine using OpenAI’s CLIP and the COCO dataset. It enables users to retrieve semantically relevant images based on either a text prompt (e.g., “A Panda”) or an example image. The system uses CLIP to embed both images and text into a shared latent space and leverages FAISS for efficient similarity-based indexing and retrieval. Results are presented through an intuitive Gradio web interface, offering an interactive and scalable solution for content-based image search.

Features

  • Text-based and image-based image search
  • Local dataset (COCO) visual search using FAISS
  • Web search using Unsplash API
  • Semantic similarity using CLIP & Evaluation metrics
  • Responsive, clean Streamlit interface

Technology Stack

User Interface (Streamlit)

Image_Interface

Application with Accuracy:

Image_Query

Demo

Demo Video

Evaluation Metrics

  • Precision by K calls – Relevance among top-K results

  • Recall by K calls – Coverage of relevant images in top-K

  • mAP – Mean average precision over the returned set

where, K - represent the no of images (K = 6 default)

Installation

1. Clone the repository

git clone https://github.com/your-repo/clip-visual-search.git
cd clip-visual-search

2. Install dependencies

Make sure you have Python 3.10+ installed, then run:

pip install -r requirements.txt

3. Prepare local dataset

  • Download the COCO dataset
  • Download the following files:
    • train2017 images (around 18GB).
    • captions_train2017.json (annotations for the train set).

4. Generate FAISS, captions

  • faiss_clip_index.idx – FAISS index for CLIP image embeddings
  • captions.npy – NumPy array of image captions

6. Set your Unsplash API key

Go to Unsplash Developers, create an app, and get your Access Key.

Replace the following line in vllm.py with your key:

UNSPLASH_ACCESS_KEY = "YOUR_UNSPLASH_ACCESS_KEY"

You're now ready to launch the app with:

streamlit run vllm.py

📚 References

Reference Number Title / Source Link
[1] Learning Transferable Visual Models From Natural Language Supervision https://arxiv.org/pdf/2103.00020
[2] FAISS: Facebook AI Similarity Search https://arxiv.org/pdf/2401.08281
[3] Streamlit Documentation https://docs.streamlit.io
[4] Unsplash API https://unsplash.com/developers
[5] scikit-learn: Precision, Recall, mAP Metrics https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
[6] COCO: Common Objects in Context Dataset https://cocodataset.org/#home

About

Visual Search Engine using COCO Dataset and CLIP

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages