Skip to content

Conversation

@sahilx13
Copy link
Collaborator

🔍 Add Text and Image Embeddings with Vector Database Search

Summary

This PR adds semantic search capabilities to RescueBox by integrating text and image embeddings with a PostgreSQL + pgvector database backend. Users can now embed text documents and images, then perform fast similarity searches using natural language queries.

🎯 Key Features

1. Text Embeddings Plugin (src/text-embeddings/)

  • 📄 Embedding Generation: Process text files (TXT, MD, LOG) using SentenceTransformer models
  • 🔤 Smart Chunking: Support for LangChain and LlamaIndex text splitters with configurable chunk size/overlap
  • 🔍 Semantic Search: Find similar documents using natural language queries
  • Multiple Models: Support for all-MiniLM-L6-v2, all-mpnet-base-v2, and multi-qa-MiniLM-L6-cos-v1
  • 📊 384-dim embeddings stored in PostgreSQL with automatic indexing

2. Image Embeddings Plugin (src/image-embeddings/)

  • 🖼️ Image Embedding Generation: Process images (JPG, PNG, BMP, GIF, TIFF, WebP) using OpenAI's CLIP models
  • 🔀 Cross-Modal Search: Search for images using text queries (text-to-image retrieval)
  • 🎨 Multiple Formats: Support for all common image formats
  • 🤖 CLIP Models: Choose between ViT-B/32 (512-dim, faster) and ViT-L/14 (768-dim, more accurate)
  • 🌉 Shared Embedding Space: Text and image embeddings live in the same vector space for cross-modal matching

3. Database Integration

  • 🐘 PostgreSQL + pgvector: Vector similarity search with native database support
  • 🚀 HNSW Indexing: High-performance approximate nearest neighbor search
  • 📈 Scalable: Efficiently handles millions of embeddings
  • 🔒 Production-Ready: Connection pooling, health checks, and proper error handling

📂 New Files

Plugins

src/text-embeddings/          # Text embedding & search plugin
  ├── text_embeddings/main.py  # /embed_text and /search_text endpoints
  ├── README.md                # Comprehensive documentation
  └── tests/                   # Unit tests for both endpoints

src/image-embeddings/          # Image embedding & search plugin
  ├── image_embeddings/main.py # /embed_images and /search_images endpoints
  ├── README.md                # Comprehensive documentation
  └── tests/                   # Unit tests for both endpoints

Database & Infrastructure

src/rb-api/rb/api/database.py  # SQLModel schemas + pgvector integration
check_db.py                     # Database inspection utility
.devcontainer/
  ├── docker-compose.yml        # PostgreSQL + pgvector service
  └── init-pgvector.sql         # pgvector extension initialization
experiments/texts/               # Sample text files for testing

🏗️ Database Schema

-- Text embeddings table
CREATE TABLE text_embeddings (
    id SERIAL PRIMARY KEY,
    path VARCHAR NOT NULL,
    embedding VECTOR(384) NOT NULL
);
CREATE INDEX item_vector_idx ON text_embeddings 
  USING hnsw (embedding vector_l2_ops);

-- Image embeddings table  
CREATE TABLE image_embeddings (
    id SERIAL PRIMARY KEY,
    path VARCHAR NOT NULL,
    embedding VECTOR(512) NOT NULL
);
CREATE INDEX item_vector_idx ON image_embeddings 
  USING hnsw (embedding vector_l2_ops);

📦 New Dependencies

# Core ML/AI
torch = "2.7.1"
transformers = "^4.40.0"
sentence-transformers = "3.0.1"

# Text Processing
langchain-text-splitters = "0.2.2"

# Database
sqlmodel = "^0.0.27"
psycopg2-binary = "^2.9.11"
pgvector = "^0.4.1"

🚀 Usage Examples

Text Embeddings Workflow

# 1. Embed documents
rescuebox text_embeddings /embed_text ./documents "all-MiniLM-L6-v2,langchain,800,100"

# 2. Search for similar documents
rescuebox text_embeddings /search_text "machine learning algorithms" "all-MiniLM-L6-v2,5"

Output:

{
  "query": "machine learning algorithms",
  "results": [
    {"path": "/docs/ml_intro.txt", "similarity": 0.8932},
    {"path": "/docs/neural_nets.md", "similarity": 0.8654}
  ]
}

Image Search Workflow

# 1. Embed images
rescuebox image_embeddings /embed_images ./photos "openai/clip-vit-base-patch32"

# 2. Search images with text
rescuebox image_embeddings /search_images "sunset over mountains" "openai/clip-vit-base-patch32,10"

Output:

{
  "query": "sunset over mountains",
  "results": [
    {"path": "/photos/mountain_sunset.jpg", "similarity": 0.9123},
    {"path": "/photos/alpine_evening.png", "similarity": 0.8876}
  ]
}

🏎️ Performance Optimizations

pgvector Integration

  • Native Vector Operations: Uses pgvector's optimized <=> cosine distance operator
  • HNSW Indexing: Approximate nearest neighbor search for sub-millisecond queries
  • Batch Processing: Commits embeddings in transactions for efficiency
  • Connection Pooling: Reuses database connections with pool_pre_ping

Search Performance

-- Optimized vector similarity search
SELECT path, 1 - (embedding <=> query) as similarity
FROM text_embeddings
ORDER BY embedding <=> query
LIMIT k

🛠️ Infrastructure Changes

DevContainer Updates

  • Added PostgreSQL 17 service with pgvector extension
  • Automatic pgvector initialization on container startup
  • Port mapping: 5433:5432 (to avoid conflicts with existing postgres instances)
  • Health checks for database readiness
  • Volume mounting for data persistence

Database Utilities

  • check_db.py: Inspect database tables, row counts, indexes, and extensions
  • Useful for debugging and verification during development

🧪 Testing

Both plugins include comprehensive test suites:

  • Schema validation tests
  • Input/output type checking
  • Search functionality tests
  • Edge case handling

Run tests:

pytest src/text-embeddings/tests/
pytest src/image-embeddings/tests/

🔧 Technical Details

Text Embeddings

  • Models: SentenceTransformer models from Hugging Face
  • Chunking: Configurable strategies (LangChain RecursiveCharacterTextSplitter, LlamaIndex SentenceSplitter)
  • Normalization: Embeddings are normalized for cosine similarity
  • Aggregation: Chunk embeddings averaged to create file-level embeddings

Image Embeddings

  • Models: OpenAI CLIP (Contrastive Language-Image Pre-Training)
  • Cross-Modal: Text and image embeddings share the same vector space
  • Zero-Shot: Works on arbitrary text queries without fine-tuning
  • Normalization: Embeddings normalized for cosine similarity

📊 Use Cases

Digital Forensics

  • Search evidence documents by semantic meaning
  • Find images matching witness descriptions
  • Discover related files without manual tagging

Content Discovery

  • "Find all documents about network security"
  • "Show me images of outdoor crime scenes"
  • "Locate files similar to this police report"

Dataset Exploration

  • Quickly understand large document/image collections
  • Find duplicates and similar content
  • Organize unstructured data

🔜 Future Enhancements

  • Hybrid search (combining vector + keyword search)
  • Image-to-image search endpoint
  • Multi-modal search (combining text + image queries)
  • GPU acceleration for embedding generation
  • Batch import APIs for large datasets
  • Export/import functionality for embeddings

🧹 Clean Up

  • Resolved merge conflicts in poetry.lock from branch 83-backend-db-with-text-embeddings
  • Regenerated lock file with Poetry 2.2.1
  • Fixed devcontainer port conflicts

📝 Documentation

Both plugins include comprehensive README files with:

  • Feature descriptions
  • Usage examples
  • Model information
  • Performance tips
  • Database schema details
  • Troubleshooting guides

Branch: backend-db-with-text-embeddings-devcontainer
Related Issues: Closes #83

nb950 and others added 30 commits September 30, 2025 11:53
…unking params; include hidden defaults in payload
prasannals and others added 8 commits October 24, 2025 14:43
- Create init-pgvector.sql to automatically enable pgvector extension
- Mount initialization script in docker-compose.yml
- Ensures pgvector is available when PostgreSQL container starts

This fixes the missing vector extension dependency for text embeddings.
- Add check_db.py for easy PostgreSQL database inspection
- Shows tables, row counts, indexes, and extensions
- Useful for debugging and verifying database state
…ings-devcontainer

Backend db with text embeddings devcontainer
@nb950
Copy link
Collaborator

nb950 commented Oct 27, 2025

can we add seach/find image from facematch ,
prompt : find this specific face ( input image) from a set of images should also work ?

https://github.com/UMass-Rescue/RescueBox/blob/main/src/face-detection-recognition/face_detection_recognition/utils/get_batch_embeddings.py#L14 get_embeddings() and save to db.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rescuebox: add backend database like postgres or mqsql

5 participants