95 unified semantic search service #104

sahilx13 · 2025-10-24T20:34:25Z

🔍 Add Text and Image Embeddings with Vector Database Search

Summary

This PR adds semantic search capabilities to RescueBox by integrating text and image embeddings with a PostgreSQL + pgvector database backend. Users can now embed text documents and images, then perform fast similarity searches using natural language queries.

🎯 Key Features

1. Text Embeddings Plugin (`src/text-embeddings/`)

📄 Embedding Generation: Process text files (TXT, MD, LOG) using SentenceTransformer models
🔤 Smart Chunking: Support for LangChain and LlamaIndex text splitters with configurable chunk size/overlap
🔍 Semantic Search: Find similar documents using natural language queries
⚡ Multiple Models: Support for all-MiniLM-L6-v2, all-mpnet-base-v2, and multi-qa-MiniLM-L6-cos-v1
📊 384-dim embeddings stored in PostgreSQL with automatic indexing

2. Image Embeddings Plugin (`src/image-embeddings/`)

🖼️ Image Embedding Generation: Process images (JPG, PNG, BMP, GIF, TIFF, WebP) using OpenAI's CLIP models
🔀 Cross-Modal Search: Search for images using text queries (text-to-image retrieval)
🎨 Multiple Formats: Support for all common image formats
🤖 CLIP Models: Choose between ViT-B/32 (512-dim, faster) and ViT-L/14 (768-dim, more accurate)
🌉 Shared Embedding Space: Text and image embeddings live in the same vector space for cross-modal matching

3. Database Integration

🐘 PostgreSQL + pgvector: Vector similarity search with native database support
🚀 HNSW Indexing: High-performance approximate nearest neighbor search
📈 Scalable: Efficiently handles millions of embeddings
🔒 Production-Ready: Connection pooling, health checks, and proper error handling

📂 New Files

Plugins

src/text-embeddings/          # Text embedding & search plugin
  ├── text_embeddings/main.py  # /embed_text and /search_text endpoints
  ├── README.md                # Comprehensive documentation
  └── tests/                   # Unit tests for both endpoints

src/image-embeddings/          # Image embedding & search plugin
  ├── image_embeddings/main.py # /embed_images and /search_images endpoints
  ├── README.md                # Comprehensive documentation
  └── tests/                   # Unit tests for both endpoints

Database & Infrastructure

src/rb-api/rb/api/database.py  # SQLModel schemas + pgvector integration
check_db.py                     # Database inspection utility
.devcontainer/
  ├── docker-compose.yml        # PostgreSQL + pgvector service
  └── init-pgvector.sql         # pgvector extension initialization
experiments/texts/               # Sample text files for testing

🏗️ Database Schema

-- Text embeddings table
CREATE TABLE text_embeddings (
    id SERIAL PRIMARY KEY,
    path VARCHAR NOT NULL,
    embedding VECTOR(384) NOT NULL
);
CREATE INDEX item_vector_idx ON text_embeddings 
  USING hnsw (embedding vector_l2_ops);

-- Image embeddings table  
CREATE TABLE image_embeddings (
    id SERIAL PRIMARY KEY,
    path VARCHAR NOT NULL,
    embedding VECTOR(512) NOT NULL
);
CREATE INDEX item_vector_idx ON image_embeddings 
  USING hnsw (embedding vector_l2_ops);

📦 New Dependencies

# Core ML/AI
torch = "2.7.1"
transformers = "^4.40.0"
sentence-transformers = "3.0.1"

# Text Processing
langchain-text-splitters = "0.2.2"

# Database
sqlmodel = "^0.0.27"
psycopg2-binary = "^2.9.11"
pgvector = "^0.4.1"

🚀 Usage Examples

Text Embeddings Workflow

# 1. Embed documents
rescuebox text_embeddings /embed_text ./documents "all-MiniLM-L6-v2,langchain,800,100"

# 2. Search for similar documents
rescuebox text_embeddings /search_text "machine learning algorithms" "all-MiniLM-L6-v2,5"

Output:

{
  "query": "machine learning algorithms",
  "results": [
    {"path": "/docs/ml_intro.txt", "similarity": 0.8932},
    {"path": "/docs/neural_nets.md", "similarity": 0.8654}
  ]
}

Image Search Workflow

# 1. Embed images
rescuebox image_embeddings /embed_images ./photos "openai/clip-vit-base-patch32"

# 2. Search images with text
rescuebox image_embeddings /search_images "sunset over mountains" "openai/clip-vit-base-patch32,10"

Output:

{
  "query": "sunset over mountains",
  "results": [
    {"path": "/photos/mountain_sunset.jpg", "similarity": 0.9123},
    {"path": "/photos/alpine_evening.png", "similarity": 0.8876}
  ]
}

🏎️ Performance Optimizations

pgvector Integration

✅ Native Vector Operations: Uses pgvector's optimized <=> cosine distance operator
✅ HNSW Indexing: Approximate nearest neighbor search for sub-millisecond queries
✅ Batch Processing: Commits embeddings in transactions for efficiency
✅ Connection Pooling: Reuses database connections with pool_pre_ping

Search Performance

-- Optimized vector similarity search
SELECT path, 1 - (embedding <=> query) as similarity
FROM text_embeddings
ORDER BY embedding <=> query
LIMIT k

🛠️ Infrastructure Changes

DevContainer Updates

Added PostgreSQL 17 service with pgvector extension
Automatic pgvector initialization on container startup
Port mapping: 5433:5432 (to avoid conflicts with existing postgres instances)
Health checks for database readiness
Volume mounting for data persistence

Database Utilities

check_db.py: Inspect database tables, row counts, indexes, and extensions
Useful for debugging and verification during development

🧪 Testing

Both plugins include comprehensive test suites:

Schema validation tests
Input/output type checking
Search functionality tests
Edge case handling

Run tests:

pytest src/text-embeddings/tests/
pytest src/image-embeddings/tests/

🔧 Technical Details

Text Embeddings

Models: SentenceTransformer models from Hugging Face
Chunking: Configurable strategies (LangChain RecursiveCharacterTextSplitter, LlamaIndex SentenceSplitter)
Normalization: Embeddings are normalized for cosine similarity
Aggregation: Chunk embeddings averaged to create file-level embeddings

Image Embeddings

Models: OpenAI CLIP (Contrastive Language-Image Pre-Training)
Cross-Modal: Text and image embeddings share the same vector space
Zero-Shot: Works on arbitrary text queries without fine-tuning
Normalization: Embeddings normalized for cosine similarity

📊 Use Cases

Digital Forensics

Search evidence documents by semantic meaning
Find images matching witness descriptions
Discover related files without manual tagging

Content Discovery

"Find all documents about network security"
"Show me images of outdoor crime scenes"
"Locate files similar to this police report"

Dataset Exploration

Quickly understand large document/image collections
Find duplicates and similar content
Organize unstructured data

🔜 Future Enhancements

Hybrid search (combining vector + keyword search)
Image-to-image search endpoint
Multi-modal search (combining text + image queries)
GPU acceleration for embedding generation
Batch import APIs for large datasets
Export/import functionality for embeddings

🧹 Clean Up

Resolved merge conflicts in poetry.lock from branch 83-backend-db-with-text-embeddings
Regenerated lock file with Poetry 2.2.1
Fixed devcontainer port conflicts

📝 Documentation

Both plugins include comprehensive README files with:

Feature descriptions
Usage examples
Model information
Performance tips
Database schema details
Troubleshooting guides

Branch: backend-db-with-text-embeddings-devcontainer
Related Issues: Closes #83

73

… rescue_box user

…ransformer; register plugin and deps

…m builder

…unking params; include hidden defaults in payload

… embeddings testing

…scueBox into 83-backend-db-connection

- Create init-pgvector.sql to automatically enable pgvector extension - Mount initialization script in docker-compose.yml - Ensures pgvector is available when PostgreSQL container starts This fixes the missing vector extension dependency for text embeddings.

- Add check_db.py for easy PostgreSQL database inspection - Shows tables, row counts, indexes, and extensions - Useful for debugging and verifying database state

…h-text-embeddings-devcontainer

…ings-devcontainer Backend db with text embeddings devcontainer

nb950 · 2025-10-27T23:28:09Z

can we add seach/find image from facematch ,
prompt : find this specific face ( input image) from a set of images should also work ?

https://github.com/UMass-Rescue/RescueBox/blob/main/src/face-detection-recognition/face_detection_recognition/utils/get_batch_embeddings.py#L14 get_embeddings() and save to db.

nb950 and others added 30 commits September 30, 2025 11:53

add new image plugins for caption , summary for hackathon ideas 53 and

e82d8c7

73

add hello-world plugin

1b80cbe

update poertry lock and add plugins to init.py

8bc8862

remove outputs, update Dockerfile

231c5c9

add model download link

b377d80

update readme

14a6117

update readme ui details

50bb01d

add video link

47c6022

update docker files for desktop

346be1c

add amd64 to dockerfile

e28be8a

fix readme link

818a428

Updates to support SQLModel + PG

f42eb90

Addressed the initialization issue

6740a05

Overriding the table name of MediaCollection

ea9704d

Some updates pertaining to credentials and prereqs

81028aa

Updated PG version so pgvector works; granting broader permissions to…

19a0318

… rescue_box user

Added a PoC of a vector column

23e2c06

feat(plugins): add text-embeddings plugin with chunking via SentenceT…

b42831e

…ransformer; register plugin and deps

feat(api): support InputSchema default and surface defaults to UI for…

2b61310

…m builder

feat(ui): prefill inputs from schema defaults; lock/hide model and ch…

6e6c1e5

…unking params; include hidden defaults in payload

install pgvector

c697159

chore(experiments): add sample directory with multiple text files for…

d5679f7

… embeddings testing

Switched pip install to use poetry instead

0a4ce67

add steps to install pgvector

1f1394d

Merge branch '83-backend-db-connection' of github.com:UMass-Rescue/Re…

907dabf

…scueBox into 83-backend-db-connection

README reworking

74dc610

Resolved merge conflicts

ad4a49c

Updated poetry

bbd1f64

Added TextEmbedding data model

d07081a

Added image embedding plugin

83aebcb

prasannals and others added 8 commits October 24, 2025 14:43

Merge branch '83-backend-db-with-text-embeddings' into img_embs

346282d

Added HNSW Index

13c7163

Text and image search

4e4f9d3

Added hnsw index

5d89072

Add database inspection utility script

1bf1ddf

- Add check_db.py for easy PostgreSQL database inspection - Shows tables, row counts, indexes, and extensions - Useful for debugging and verifying database state

Merge branch '83-backend-db-with-text-embeddings' into backend-db-wit…

03832ec

…h-text-embeddings-devcontainer

Merge pull request #103 from UMass-Rescue/backend-db-with-text-embedd…

9a9cf6a

…ings-devcontainer Backend db with text embeddings devcontainer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

95 unified semantic search service #104

95 unified semantic search service #104

Uh oh!

sahilx13 commented Oct 24, 2025

Uh oh!

nb950 commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

95 unified semantic search service #104

Are you sure you want to change the base?

95 unified semantic search service #104

Uh oh!

Conversation

sahilx13 commented Oct 24, 2025

🔍 Add Text and Image Embeddings with Vector Database Search

Summary

🎯 Key Features

1. Text Embeddings Plugin (src/text-embeddings/)

2. Image Embeddings Plugin (src/image-embeddings/)

3. Database Integration

📂 New Files

Plugins

Database & Infrastructure

🏗️ Database Schema

📦 New Dependencies

🚀 Usage Examples

Text Embeddings Workflow

Image Search Workflow

🏎️ Performance Optimizations

pgvector Integration

Search Performance

🛠️ Infrastructure Changes

DevContainer Updates

Database Utilities

🧪 Testing

🔧 Technical Details

Text Embeddings

Image Embeddings

📊 Use Cases

Digital Forensics

Content Discovery

Dataset Exploration

🔜 Future Enhancements

🧹 Clean Up

📝 Documentation

Uh oh!

nb950 commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

1. Text Embeddings Plugin (`src/text-embeddings/`)

2. Image Embeddings Plugin (`src/image-embeddings/`)