Skip to content

A RAG application with hybrid search (BM25 + semantic with RRF fusion), cross-encoder re-ranking, and configurable retrieval presets for intelligent PDF document Q&A. Features conversation context, A/B testing, and full transparency into relevance scores β€” powered by OpenAI, ChromaDB, and LangChain.

Notifications You must be signed in to change notification settings

shrimpy8/semantic-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” Semantic Search Engine

A RAG (Retrieval Augmented Generation) application built with LangChain, ChromaDB, and Streamlit. Upload PDF documents and ask natural language questions to retrieve contextually relevant answers powered by OpenAI.

Python Version License

πŸ”„ How Search Works

When you ask a question, the system follows this RAG (Retrieval-Augmented Generation) pipeline:

How Search Works

Step Component Description
1. Chunk Document Processor PDF split into searchable text pieces
2. Embed OpenAI Creates vector embeddings (text-embedding-3-large)
3. Hybrid Search BM25 + Semantic Combines keyword and meaning-based search
4. Rerank Jina (local) Cross-encoder refines result relevance
5. Answer GPT-4o-mini Generates response based only on your documents

Note: The LLM only answers based on your uploaded documents, not general knowledge. Search on the home page includes all documents and collections.

✨ Features

Core Features

  • πŸ“„ PDF Document Processing: Upload and parse PDF files with automatic text extraction
  • πŸ”— Semantic Chunking: Intelligent text splitting with configurable chunk sizes and overlap
  • 🎯 Vector Embeddings: High-quality embeddings using OpenAI text-embedding-3-large
  • πŸ’Ύ Persistent Vector Store: ChromaDB for efficient similarity search with disk persistence
  • πŸ’¬ Natural Language Queries: Ask questions in plain language about your documents
  • ✨ Contextual Answers: GPT-4o-mini generates answers based solely on document content
  • πŸ”„ Real-time Streaming: Token-by-token answer generation for responsive UX
  • πŸ”ƒ Document Persistence: Uploaded documents persist across app restarts

Navigation & UI

  • 🧭 Sidebar Navigation: Quick access to all pages (Search, How It Works, Collections)
  • πŸ“„ Documents Panel: View document statistics (chunks, size, overlap) and list all uploaded documents on the home page
  • 🎨 Consistent Branding: Unified styling and navigation across all pages
  • πŸ“± Responsive Layout: Works well on different screen sizes

Collection Management

  • πŸ“ Document Collections: Organize documents into searchable collections
  • 🎯 Scoped Search: Search within specific collections or across all documents
  • πŸ“Š Collection Stats: Track document counts and chunk counts per collection
  • βš™οΈ Per-Collection Settings: Configure chunk size and overlap per collection
  • πŸ—‘οΈ Cascade Deletion: Deleting a collection removes all associated documents and chunks
  • πŸ”„ Duplicate Detection: Prevents uploading the same document twice to a collection
  • πŸ“ Search History: Per-collection search results persist during session navigation

Advanced Retrieval

  • πŸ”€ Hybrid Search: Combine semantic (vector) and BM25 (keyword) search with RRF fusion
  • πŸŽ›οΈ Retrieval Presets: Pre-configured profiles accessible from the sidebar dropdown:
    • 🎯 High Precision: Fewer, highly relevant results (k=3, alpha=0.7, reranking on)
    • βš–οΈ Balanced: Good mix of precision and recall (k=5, alpha=0.5, reranking on)
    • πŸ” High Recall: More results, broader coverage (k=10, alpha=0.3, reranking off)
  • πŸ† Re-ranking: Cross-encoder re-ranking with configurable provider:
    • Jina (local): No API key needed, no latency - requires sentence-transformers
    • Cohere (cloud): Fast API, excellent quality - requires COHERE_API_KEY
    • Auto mode (default): Tries Jina (local) first, falls back to Cohere (cloud)
    • Override: Set provider: "cohere" in config.yaml to force cloud API
  • πŸ“Š Score Visibility: View semantic, BM25, and rerank scores for each result

User Experience

  • πŸ’¬ Conversation History: Context-aware follow-up questions within sessions
  • πŸ§ͺ A/B Testing Framework: Compare all 4 retrieval methods (semantic, BM25, hybrid, hybrid+rerank) with a single click
  • πŸ“‹ Duplicate Detection: Automatic check before uploading already-indexed PDFs
  • πŸ“Š Context Transparency: View exactly which document chunks were used for answers

Developer Experience

  • πŸ›‘οΈ Robust Error Handling: Automatic retry logic with exponential backoff for API resilience
  • πŸ“ Structured Logging: Comprehensive logging for debugging and monitoring
  • βš™οΈ YAML Configuration: Easily customizable settings without code changes
  • 🎨 Modular Architecture: Clean separation of concerns for maintainability
  • πŸ§ͺ Comprehensive Test Suite: 290+ unit and integration tests

πŸ“Έ Screenshots

Search Page

Upload PDFs and view your document library with statistics:

Search Page - Document Upload

Semantic Search with Answers

Ask questions and get AI-generated answers with source context:

Search Results with Answer

Context Chunks with Scores

View retrieved chunks with semantic and rerank scores:

Context Chunks with Scores


Collection Management

Organize documents into searchable collections:

Collections List and Create

Documents in a Collection

View and manage documents within a collection:

Collection Documents

Collection Search

Search within a specific collection for focused results:

Collection Search


How It Works - Interactive Documentation

The app includes a comprehensive "How It Works" page with interactive guides:

Guide Description Screenshot
Retrieval Methods Semantic, BM25, and hybrid search explained View
Precision vs Recall Trade-offs and preset configurations View
Collections How to organize documents effectively View
A/B Testing Compare retrieval methods empirically View
Configuration All settings explained with examples View
Conversation History Context-aware follow-up questions View

πŸ—οΈ Architecture

Project Structure

semantic-search/
β”œβ”€β”€ app.py                      # Streamlit UI application
β”œβ”€β”€ config.yaml                 # Centralized configuration
β”œβ”€β”€ config_loader.py            # Configuration management
β”œβ”€β”€ core/                       # Core business logic
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ document_processor.py   # PDF loading and chunking
β”‚   β”œβ”€β”€ vector_store.py         # ChromaDB management
β”‚   β”œβ”€β”€ qa_chain.py             # Question answering pipeline
β”‚   β”œβ”€β”€ hybrid_retriever.py     # Hybrid search (BM25 + Semantic)
β”‚   β”œβ”€β”€ bm25_retriever.py       # BM25 keyword search
β”‚   β”œβ”€β”€ reranker.py             # Cohere/Jina re-ranking
β”‚   β”œβ”€β”€ conversation.py         # Conversation history management
β”‚   β”œβ”€β”€ ab_testing.py           # A/B testing framework
β”‚   β”œβ”€β”€ collection_manager.py   # Collection CRUD operations
β”‚   β”œβ”€β”€ document_manager.py     # Document CRUD operations
β”‚   β”œβ”€β”€ search_manager.py       # Unified search interface
β”‚   β”œβ”€β”€ storage.py              # JSON file persistence
β”‚   └── models/                 # Data models
β”‚       β”œβ”€β”€ collection.py       # Collection model
β”‚       β”œβ”€β”€ document.py         # Document model
β”‚       β”œβ”€β”€ search.py           # Search request/response models
β”‚       β”œβ”€β”€ responses.py        # API response models
β”‚       └── errors.py           # Custom exceptions
β”œβ”€β”€ pages/                      # Streamlit multi-page app
β”‚   β”œβ”€β”€ 1_How_It_Works.py       # Interactive documentation
β”‚   └── 2_Collections.py        # Collection management UI
β”œβ”€β”€ ui/                         # Shared UI components
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ shared_components.py    # Branding, navigation, CSS
β”‚   └── sidebar_components.py   # Retrieval settings, config display
β”œβ”€β”€ utils/                      # Utility functions
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── retry_utils.py          # API retry decorators
β”œβ”€β”€ tests/                      # Test suite
β”‚   β”œβ”€β”€ conftest.py             # pytest fixtures
β”‚   └── test_*.py               # Unit and integration tests
β”œβ”€β”€ data/                       # Data storage
β”‚   β”œβ”€β”€ collections.json        # Collection metadata
β”‚   └── documents.json          # Document metadata
β”œβ”€β”€ screenshots/                # Application screenshots
β”œβ”€β”€ pytest.ini                  # pytest configuration
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ HOW_IT_WORKS.md             # Detailed technical documentation
β”œβ”€β”€ .env.example                # Environment variables template
β”œβ”€β”€ .gitignore                  # Git ignore rules
└── README.md                   # This file

RAG Pipeline

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  PDF Upload  β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Document        β”‚
β”‚ Processor       β”‚ (PDF β†’ Pages β†’ Chunks)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Vector Store    β”‚
β”‚ Manager         β”‚ (Chunks β†’ Embeddings β†’ ChromaDB)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ User Question   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Hybrid          β”‚
β”‚ Retriever       β”‚ (Semantic + BM25 β†’ RRF Fusion)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Re-ranker       β”‚
β”‚ (Optional)      β”‚ (Cross-encoder scoring)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ QA Chain        β”‚ (Context β†’ GPT-4o-mini β†’ Answer)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Streamed Answer β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“‹ Prerequisites

  • Python 3.8 or higher
  • OpenAI API key (Get one here)
  • 2GB+ free disk space (for ChromaDB vector store)

Optional Dependencies

  • sentence-transformers - For Jina re-ranker (local, preferred): pip install sentence-transformers
  • Cohere API key (Get one here) - For Cohere re-ranker (cloud fallback)
  • The system prefers Jina (local) first, then falls back to Cohere (cloud) if Jina is unavailable

πŸš€ Installation

1. Clone the Repository

git clone https://github.com/shrimpy8/semantic-serach.git
cd semantic-serach

2. Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

4. Configure Environment Variables

# Copy the example environment file
cp .env.example .env

# Edit .env and add your API keys
OPENAI_API_KEY=your_openai_api_key_here
COHERE_API_KEY=your_cohere_api_key_here  # Optional - for Cohere reranker

⚠️ SECURITY NOTE: Never commit your .env file to version control. It's already in .gitignore.

πŸ’‘ Usage

Running the Application

streamlit run app.py

The application will open in your default browser at http://localhost:8501.

Application Pages

The app has three main pages accessible via sidebar navigation:

Page Purpose
πŸ” Search Upload documents and search (home page)
πŸ“š How It Works Interactive documentation and configuration guide
πŸ“ Collections Organize documents into searchable collections

Using the Search Page (Home)

  1. Upload PDF: Click "Select a PDF file" and choose your document
  2. Wait for Processing: The app extracts text, creates chunks, and generates embeddings
  3. View Documents Panel: Expand to see document list and statistics (chunks, size, overlap)
  4. Select Retrieval Preset: Choose High Precision, Balanced, or High Recall from sidebar
  5. Ask Questions: Type your question in the chat input at the bottom
  6. View Scores: Each result shows relevance scores (semantic, BM25, rerank)
  7. View Context: Expand source sections to see the chunks used for the answer

Managing Collections

  1. Navigate to Collections: Click "πŸ“ Collections" in the sidebar navigation
  2. Create a Collection: Enter a name and optional description, then click "Create Collection"
  3. Upload Documents: Select a collection, go to the "Upload" tab, and add PDF files
  4. View Documents: See all documents in a collection with their status and chunk counts
  5. Search in Collection: Use the "Search" tab to search within a specific collection
  6. Delete: Remove individual documents or entire collections (cascade deletes all associated data)

Retrieval Presets

Preset Results (k) Alpha Reranking Best For
🎯 High Precision 3 0.7 On Specific questions, exact answers
βš–οΈ Balanced 5 0.5 On General queries (default)
πŸ” High Recall 10 0.3 Off Comprehensive research

A/B Testing

  1. Upload a document on the Search page
  2. Expand the A/B Testing panel (below Documents panel)
  3. Enter a test query representative of real usage
  4. Click "Run Comparison" - automatically tests all 4 methods:
    • Semantic (pure vector search)
    • BM25 (pure keyword search)
    • Hybrid (combined, no reranking)
    • Hybrid + Rerank (combined with reranking)
  5. View results: Average score, latency, and recommended variant
  6. Export to CSV for deeper analysis

βš™οΈ Configuration

All settings are managed in config.yaml. Customize without changing code:

Model Configuration

models:
  embedding:
    name: "text-embedding-3-large"  # OpenAI embedding model
  chat:
    name: "gpt-4o-mini"              # Chat model
    temperature: 0.0                 # 0.0 = deterministic

Retrieval Presets

retrieval_presets:
  high_precision:
    search_k: 3
    alpha: 0.7
    reranking: true
    description: "Fewer, highly relevant results"
  balanced:
    search_k: 5
    alpha: 0.5
    reranking: true
    description: "Good balance of precision and recall"
  high_recall:
    search_k: 10
    alpha: 0.3
    reranking: false
    description: "More results, broader coverage"

Hybrid Retrieval

hybrid_retrieval:
  enabled: true
  default_method: "hybrid"
  alpha: 0.5              # 0=BM25 only, 1=Semantic only
  rrf_k: 60               # RRF fusion constant

  bm25:
    k1: 1.5               # Term frequency saturation
    b: 0.75               # Length normalization

  reranking:
    enabled: true
    # Provider options:
    #   - "auto": jina (local) first, then cohere (cloud)
    #   - "jina": force local model
    #   - "cohere": force cloud API
    provider: "auto"
    fetch_k_multiplier: 3 # Fetch 3x candidates for reranking

Document Processing

document_processing:
  chunk_size: 1000        # Characters per chunk
  chunk_overlap: 200      # Overlap between chunks
  add_start_index: true   # Add metadata for chunk position

πŸ› οΈ Development

Module Overview

Module Purpose
core/document_processor.py PDF loading, text extraction, chunking
core/vector_store.py ChromaDB management, similarity search
core/hybrid_retriever.py BM25 + Semantic fusion with RRF
core/reranker.py Cohere/Jina cross-encoder re-ranking
core/qa_chain.py RAG pipeline, answer generation
core/collection_manager.py Collection CRUD operations
core/document_manager.py Document CRUD operations
core/search_manager.py Unified search interface
ui/shared_components.py Branding, navigation, shared CSS
ui/sidebar_components.py Retrieval settings, configuration display

Testing

The project includes a comprehensive test suite with 290+ tests:

# Run all tests
pytest tests/ -v

# Run specific test categories
pytest tests/ -v -m unit         # Unit tests only
pytest tests/ -v -m integration  # Integration tests

# Test coverage
pytest tests/ --cov=. --cov-report=html

Test Files:

  • tests/test_models.py - Data models (Collection, Document, Search)
  • tests/test_collection_manager.py - Collection CRUD operations
  • tests/test_document_manager.py - Document CRUD operations
  • tests/test_search_manager.py - Search functionality
  • tests/test_vector_store.py - ChromaDB operations
  • tests/test_hybrid_retriever.py - Hybrid search logic
  • tests/test_ab_testing.py - A/B testing framework
  • tests/test_conversation.py - Conversation history
  • tests/test_config_loader.py - Configuration loading
  • tests/test_integration.py - Real ChromaDB integration tests (document lifecycle, orphan cleanup)

Code Style

  • Follow PEP 8 guidelines
  • Comprehensive docstrings for all classes and functions
  • Type hints for function parameters and returns
  • Structured logging with logging.getLogger(__name__)

πŸ“ Logging

Application logs are written to:

  • Console: Real-time output during execution
  • File: semantic_search.log (persistent logs)

Key log messages:

πŸ” Auto-selecting reranker...
βœ… RERANKER SELECTED: Jina (local)
# or if Jina unavailable:
βœ… RERANKER SELECTED: Cohere (cloud)

Adjust logging level in config.yaml:

logging:
  level: "INFO"  # DEBUG, INFO, WARNING, ERROR, CRITICAL

πŸ”’ Security

API Key Management

  1. Never commit .env files to version control
  2. Rotate keys immediately if accidentally exposed
  3. Use environment variables for all sensitive data
  4. Review SECURITY_NOTICE.md for detailed security guidelines

Pre-commit Hook

Add to .git/hooks/pre-commit:

#!/bin/bash
if git diff --cached --name-only | grep -E '\\.env$'; then
    echo "❌ ERROR: Attempting to commit .env file!"
    exit 1
fi
exit 0

Make executable:

chmod +x .git/hooks/pre-commit

πŸ› Troubleshooting

Common Issues

Error: "Configuration file not found"

  • Ensure config.yaml exists in the project directory
  • Check file permissions

Error: "OpenAI API key not found"

  • Verify .env file exists with valid OPENAI_API_KEY
  • Ensure python-dotenv is installed

Error: "Rate limit exceeded"

  • Wait for rate limit reset
  • Retry logic should handle this automatically
  • Check API quota/billing

No reranker available

  • Install sentence-transformers for Jina (preferred): pip install sentence-transformers
  • Or add COHERE_API_KEY to .env for Cohere (cloud fallback)

Documents lost after restart

  • This is now fixed - documents persist automatically
  • If issue persists, check ChromaDB connection

ChromaDB Permission Error

  • Ensure write permissions for ./chroma/db directory
  • Try deleting the directory and restarting

πŸ“š How It Works

1. Document Upload & Processing

When you upload a PDF:

  1. File is temporarily saved to disk
  2. PyPDFLoader extracts text from all pages
  3. RecursiveCharacterTextSplitter creates overlapping chunks
  4. Temporary file is cleaned up

2. Embedding & Indexing

For each chunk:

  1. OpenAI text-embedding-3-large generates 3072-dimension vector
  2. Vector + metadata stored in ChromaDB
  3. Collection persisted to disk for future sessions

3. Retrieval (Hybrid Search)

When you search:

  1. Query processed through both semantic and BM25 retrievers
  2. Results combined using Reciprocal Rank Fusion (RRF)
  3. Alpha parameter controls the balance (configurable via presets)
  4. Optional: Cross-encoder re-ranking for improved quality

4. Answer Generation

After retrieval:

  1. Top-K chunks formatted as context
  2. GPT-4o-mini generates answer based solely on context
  3. Answer streamed token-by-token to UI
  4. Source chunks displayed with relevance scores

πŸ’‘ Tips for Better Results

  • βœ… Choose the right preset: High Precision for specific queries, High Recall for research
  • βœ… Upload well-structured PDFs with clear text (avoid scanned images)
  • βœ… Use the A/B testing framework to find optimal settings for your documents
  • βœ… Check relevance scores to understand result quality
  • βœ… Use Collections to organize documents by topic/project
  • βœ… Enable reranking for important queries (at the cost of some latency)

🚧 Limitations

  • PDF Only: Currently supports PDF files only (no DOCX, TXT, etc.)
  • Text Extraction: Quality depends on PDF structure (scanned PDFs won't work)
  • Context Window: Limited to top-K chunks (may miss relevant information)
  • English Focused: Best performance with English documents
  • Local Storage: Vector store persisted locally (not suitable for multi-user production)

πŸ—ΊοΈ Roadmap

Completed βœ…:

  • Hybrid search (BM25 + semantic) with RRF fusion
  • Re-ranking support (Cohere cloud + Jina local) with auto-selection
  • Conversation history and follow-up questions
  • Retrieval presets (High Precision / Balanced / High Recall)
  • Score visibility for search results
  • A/B testing framework for retrieval methods
  • Interactive "How It Works" documentation page
  • Multi-document collections with isolated searches
  • Sidebar navigation across all pages
  • Documents panel on home page
  • Per-collection search history persistence
  • Shared UI components (DRY refactoring)
  • Comprehensive test suite (290+ tests)
  • Document persistence across app restarts

Future enhancements planned:

  • Support for multiple file formats (DOCX, TXT, HTML, Markdown)
  • Document similarity and comparison features
  • Export search results to CSV/JSON
  • Advanced chunking strategies (sentence-based, semantic)
  • Citation tracking (show exact page/paragraph references)
  • Docker containerization
  • Next.js production UI (Stage 2)
  • FastAPI backend with Supabase (Stage 2)
  • User authentication and multi-user support

🀝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

πŸ‘€ Author

Harsh

πŸ“ž Support

If you encounter any issues or have questions:

  1. Check the Troubleshooting section
  2. Review existing GitHub Issues
  3. Create a new issue with detailed description

Made with ❀️ using LangChain, ChromaDB, and OpenAI

About

A RAG application with hybrid search (BM25 + semantic with RRF fusion), cross-encoder re-ranking, and configurable retrieval presets for intelligent PDF document Q&A. Features conversation context, A/B testing, and full transparency into relevance scores β€” powered by OpenAI, ChromaDB, and LangChain.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages