A RAG (Retrieval Augmented Generation) application built with LangChain, ChromaDB, and Streamlit. Upload PDF documents and ask natural language questions to retrieve contextually relevant answers powered by OpenAI.
When you ask a question, the system follows this RAG (Retrieval-Augmented Generation) pipeline:
| Step | Component | Description |
|---|---|---|
| 1. Chunk | Document Processor | PDF split into searchable text pieces |
| 2. Embed | OpenAI | Creates vector embeddings (text-embedding-3-large) |
| 3. Hybrid Search | BM25 + Semantic | Combines keyword and meaning-based search |
| 4. Rerank | Jina (local) | Cross-encoder refines result relevance |
| 5. Answer | GPT-4o-mini | Generates response based only on your documents |
Note: The LLM only answers based on your uploaded documents, not general knowledge. Search on the home page includes all documents and collections.
- π PDF Document Processing: Upload and parse PDF files with automatic text extraction
- π Semantic Chunking: Intelligent text splitting with configurable chunk sizes and overlap
- π― Vector Embeddings: High-quality embeddings using OpenAI text-embedding-3-large
- πΎ Persistent Vector Store: ChromaDB for efficient similarity search with disk persistence
- π¬ Natural Language Queries: Ask questions in plain language about your documents
- β¨ Contextual Answers: GPT-4o-mini generates answers based solely on document content
- π Real-time Streaming: Token-by-token answer generation for responsive UX
- π Document Persistence: Uploaded documents persist across app restarts
- π§ Sidebar Navigation: Quick access to all pages (Search, How It Works, Collections)
- π Documents Panel: View document statistics (chunks, size, overlap) and list all uploaded documents on the home page
- π¨ Consistent Branding: Unified styling and navigation across all pages
- π± Responsive Layout: Works well on different screen sizes
- π Document Collections: Organize documents into searchable collections
- π― Scoped Search: Search within specific collections or across all documents
- π Collection Stats: Track document counts and chunk counts per collection
- βοΈ Per-Collection Settings: Configure chunk size and overlap per collection
- ποΈ Cascade Deletion: Deleting a collection removes all associated documents and chunks
- π Duplicate Detection: Prevents uploading the same document twice to a collection
- π Search History: Per-collection search results persist during session navigation
- π Hybrid Search: Combine semantic (vector) and BM25 (keyword) search with RRF fusion
- ποΈ Retrieval Presets: Pre-configured profiles accessible from the sidebar dropdown:
- π― High Precision: Fewer, highly relevant results (k=3, alpha=0.7, reranking on)
- βοΈ Balanced: Good mix of precision and recall (k=5, alpha=0.5, reranking on)
- π High Recall: More results, broader coverage (k=10, alpha=0.3, reranking off)
- π Re-ranking: Cross-encoder re-ranking with configurable provider:
- Jina (local): No API key needed, no latency - requires
sentence-transformers - Cohere (cloud): Fast API, excellent quality - requires
COHERE_API_KEY - Auto mode (default): Tries Jina (local) first, falls back to Cohere (cloud)
- Override: Set
provider: "cohere"in config.yaml to force cloud API
- Jina (local): No API key needed, no latency - requires
- π Score Visibility: View semantic, BM25, and rerank scores for each result
- π¬ Conversation History: Context-aware follow-up questions within sessions
- π§ͺ A/B Testing Framework: Compare all 4 retrieval methods (semantic, BM25, hybrid, hybrid+rerank) with a single click
- π Duplicate Detection: Automatic check before uploading already-indexed PDFs
- π Context Transparency: View exactly which document chunks were used for answers
- π‘οΈ Robust Error Handling: Automatic retry logic with exponential backoff for API resilience
- π Structured Logging: Comprehensive logging for debugging and monitoring
- βοΈ YAML Configuration: Easily customizable settings without code changes
- π¨ Modular Architecture: Clean separation of concerns for maintainability
- π§ͺ Comprehensive Test Suite: 290+ unit and integration tests
Upload PDFs and view your document library with statistics:
Ask questions and get AI-generated answers with source context:
View retrieved chunks with semantic and rerank scores:
Organize documents into searchable collections:
View and manage documents within a collection:
Search within a specific collection for focused results:
The app includes a comprehensive "How It Works" page with interactive guides:
| Guide | Description | Screenshot |
|---|---|---|
| Retrieval Methods | Semantic, BM25, and hybrid search explained | View |
| Precision vs Recall | Trade-offs and preset configurations | View |
| Collections | How to organize documents effectively | View |
| A/B Testing | Compare retrieval methods empirically | View |
| Configuration | All settings explained with examples | View |
| Conversation History | Context-aware follow-up questions | View |
semantic-search/
βββ app.py # Streamlit UI application
βββ config.yaml # Centralized configuration
βββ config_loader.py # Configuration management
βββ core/ # Core business logic
β βββ __init__.py
β βββ document_processor.py # PDF loading and chunking
β βββ vector_store.py # ChromaDB management
β βββ qa_chain.py # Question answering pipeline
β βββ hybrid_retriever.py # Hybrid search (BM25 + Semantic)
β βββ bm25_retriever.py # BM25 keyword search
β βββ reranker.py # Cohere/Jina re-ranking
β βββ conversation.py # Conversation history management
β βββ ab_testing.py # A/B testing framework
β βββ collection_manager.py # Collection CRUD operations
β βββ document_manager.py # Document CRUD operations
β βββ search_manager.py # Unified search interface
β βββ storage.py # JSON file persistence
β βββ models/ # Data models
β βββ collection.py # Collection model
β βββ document.py # Document model
β βββ search.py # Search request/response models
β βββ responses.py # API response models
β βββ errors.py # Custom exceptions
βββ pages/ # Streamlit multi-page app
β βββ 1_How_It_Works.py # Interactive documentation
β βββ 2_Collections.py # Collection management UI
βββ ui/ # Shared UI components
β βββ __init__.py
β βββ shared_components.py # Branding, navigation, CSS
β βββ sidebar_components.py # Retrieval settings, config display
βββ utils/ # Utility functions
β βββ __init__.py
β βββ retry_utils.py # API retry decorators
βββ tests/ # Test suite
β βββ conftest.py # pytest fixtures
β βββ test_*.py # Unit and integration tests
βββ data/ # Data storage
β βββ collections.json # Collection metadata
β βββ documents.json # Document metadata
βββ screenshots/ # Application screenshots
βββ pytest.ini # pytest configuration
βββ requirements.txt # Python dependencies
βββ HOW_IT_WORKS.md # Detailed technical documentation
βββ .env.example # Environment variables template
βββ .gitignore # Git ignore rules
βββ README.md # This file
ββββββββββββββββ
β PDF Upload β
ββββββββ¬ββββββββ
β
βΌ
βββββββββββββββββββ
β Document β
β Processor β (PDF β Pages β Chunks)
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Vector Store β
β Manager β (Chunks β Embeddings β ChromaDB)
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β User Question β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Hybrid β
β Retriever β (Semantic + BM25 β RRF Fusion)
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Re-ranker β
β (Optional) β (Cross-encoder scoring)
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β QA Chain β (Context β GPT-4o-mini β Answer)
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Streamed Answer β
βββββββββββββββββββ
- Python 3.8 or higher
- OpenAI API key (Get one here)
- 2GB+ free disk space (for ChromaDB vector store)
- sentence-transformers - For Jina re-ranker (local, preferred):
pip install sentence-transformers - Cohere API key (Get one here) - For Cohere re-ranker (cloud fallback)
- The system prefers Jina (local) first, then falls back to Cohere (cloud) if Jina is unavailable
git clone https://github.com/shrimpy8/semantic-serach.git
cd semantic-serachpython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txt# Copy the example environment file
cp .env.example .env
# Edit .env and add your API keys
OPENAI_API_KEY=your_openai_api_key_here
COHERE_API_KEY=your_cohere_api_key_here # Optional - for Cohere reranker.env file to version control. It's already in .gitignore.
streamlit run app.pyThe application will open in your default browser at http://localhost:8501.
The app has three main pages accessible via sidebar navigation:
| Page | Purpose |
|---|---|
| π Search | Upload documents and search (home page) |
| π How It Works | Interactive documentation and configuration guide |
| π Collections | Organize documents into searchable collections |
- Upload PDF: Click "Select a PDF file" and choose your document
- Wait for Processing: The app extracts text, creates chunks, and generates embeddings
- View Documents Panel: Expand to see document list and statistics (chunks, size, overlap)
- Select Retrieval Preset: Choose High Precision, Balanced, or High Recall from sidebar
- Ask Questions: Type your question in the chat input at the bottom
- View Scores: Each result shows relevance scores (semantic, BM25, rerank)
- View Context: Expand source sections to see the chunks used for the answer
- Navigate to Collections: Click "π Collections" in the sidebar navigation
- Create a Collection: Enter a name and optional description, then click "Create Collection"
- Upload Documents: Select a collection, go to the "Upload" tab, and add PDF files
- View Documents: See all documents in a collection with their status and chunk counts
- Search in Collection: Use the "Search" tab to search within a specific collection
- Delete: Remove individual documents or entire collections (cascade deletes all associated data)
| Preset | Results (k) | Alpha | Reranking | Best For |
|---|---|---|---|---|
| π― High Precision | 3 | 0.7 | On | Specific questions, exact answers |
| βοΈ Balanced | 5 | 0.5 | On | General queries (default) |
| π High Recall | 10 | 0.3 | Off | Comprehensive research |
- Upload a document on the Search page
- Expand the A/B Testing panel (below Documents panel)
- Enter a test query representative of real usage
- Click "Run Comparison" - automatically tests all 4 methods:
- Semantic (pure vector search)
- BM25 (pure keyword search)
- Hybrid (combined, no reranking)
- Hybrid + Rerank (combined with reranking)
- View results: Average score, latency, and recommended variant
- Export to CSV for deeper analysis
All settings are managed in config.yaml. Customize without changing code:
models:
embedding:
name: "text-embedding-3-large" # OpenAI embedding model
chat:
name: "gpt-4o-mini" # Chat model
temperature: 0.0 # 0.0 = deterministicretrieval_presets:
high_precision:
search_k: 3
alpha: 0.7
reranking: true
description: "Fewer, highly relevant results"
balanced:
search_k: 5
alpha: 0.5
reranking: true
description: "Good balance of precision and recall"
high_recall:
search_k: 10
alpha: 0.3
reranking: false
description: "More results, broader coverage"hybrid_retrieval:
enabled: true
default_method: "hybrid"
alpha: 0.5 # 0=BM25 only, 1=Semantic only
rrf_k: 60 # RRF fusion constant
bm25:
k1: 1.5 # Term frequency saturation
b: 0.75 # Length normalization
reranking:
enabled: true
# Provider options:
# - "auto": jina (local) first, then cohere (cloud)
# - "jina": force local model
# - "cohere": force cloud API
provider: "auto"
fetch_k_multiplier: 3 # Fetch 3x candidates for rerankingdocument_processing:
chunk_size: 1000 # Characters per chunk
chunk_overlap: 200 # Overlap between chunks
add_start_index: true # Add metadata for chunk position| Module | Purpose |
|---|---|
core/document_processor.py |
PDF loading, text extraction, chunking |
core/vector_store.py |
ChromaDB management, similarity search |
core/hybrid_retriever.py |
BM25 + Semantic fusion with RRF |
core/reranker.py |
Cohere/Jina cross-encoder re-ranking |
core/qa_chain.py |
RAG pipeline, answer generation |
core/collection_manager.py |
Collection CRUD operations |
core/document_manager.py |
Document CRUD operations |
core/search_manager.py |
Unified search interface |
ui/shared_components.py |
Branding, navigation, shared CSS |
ui/sidebar_components.py |
Retrieval settings, configuration display |
The project includes a comprehensive test suite with 290+ tests:
# Run all tests
pytest tests/ -v
# Run specific test categories
pytest tests/ -v -m unit # Unit tests only
pytest tests/ -v -m integration # Integration tests
# Test coverage
pytest tests/ --cov=. --cov-report=htmlTest Files:
tests/test_models.py- Data models (Collection, Document, Search)tests/test_collection_manager.py- Collection CRUD operationstests/test_document_manager.py- Document CRUD operationstests/test_search_manager.py- Search functionalitytests/test_vector_store.py- ChromaDB operationstests/test_hybrid_retriever.py- Hybrid search logictests/test_ab_testing.py- A/B testing frameworktests/test_conversation.py- Conversation historytests/test_config_loader.py- Configuration loadingtests/test_integration.py- Real ChromaDB integration tests (document lifecycle, orphan cleanup)
- Follow PEP 8 guidelines
- Comprehensive docstrings for all classes and functions
- Type hints for function parameters and returns
- Structured logging with
logging.getLogger(__name__)
Application logs are written to:
- Console: Real-time output during execution
- File:
semantic_search.log(persistent logs)
Key log messages:
π Auto-selecting reranker...
β
RERANKER SELECTED: Jina (local)
# or if Jina unavailable:
β
RERANKER SELECTED: Cohere (cloud)
Adjust logging level in config.yaml:
logging:
level: "INFO" # DEBUG, INFO, WARNING, ERROR, CRITICAL- Never commit
.envfiles to version control - Rotate keys immediately if accidentally exposed
- Use environment variables for all sensitive data
- Review
SECURITY_NOTICE.mdfor detailed security guidelines
Add to .git/hooks/pre-commit:
#!/bin/bash
if git diff --cached --name-only | grep -E '\\.env$'; then
echo "β ERROR: Attempting to commit .env file!"
exit 1
fi
exit 0Make executable:
chmod +x .git/hooks/pre-commitError: "Configuration file not found"
- Ensure
config.yamlexists in the project directory - Check file permissions
Error: "OpenAI API key not found"
- Verify
.envfile exists with validOPENAI_API_KEY - Ensure
python-dotenvis installed
Error: "Rate limit exceeded"
- Wait for rate limit reset
- Retry logic should handle this automatically
- Check API quota/billing
No reranker available
- Install
sentence-transformersfor Jina (preferred):pip install sentence-transformers - Or add
COHERE_API_KEYto.envfor Cohere (cloud fallback)
Documents lost after restart
- This is now fixed - documents persist automatically
- If issue persists, check ChromaDB connection
ChromaDB Permission Error
- Ensure write permissions for
./chroma/dbdirectory - Try deleting the directory and restarting
When you upload a PDF:
- File is temporarily saved to disk
- PyPDFLoader extracts text from all pages
- RecursiveCharacterTextSplitter creates overlapping chunks
- Temporary file is cleaned up
For each chunk:
- OpenAI text-embedding-3-large generates 3072-dimension vector
- Vector + metadata stored in ChromaDB
- Collection persisted to disk for future sessions
When you search:
- Query processed through both semantic and BM25 retrievers
- Results combined using Reciprocal Rank Fusion (RRF)
- Alpha parameter controls the balance (configurable via presets)
- Optional: Cross-encoder re-ranking for improved quality
After retrieval:
- Top-K chunks formatted as context
- GPT-4o-mini generates answer based solely on context
- Answer streamed token-by-token to UI
- Source chunks displayed with relevance scores
- β Choose the right preset: High Precision for specific queries, High Recall for research
- β Upload well-structured PDFs with clear text (avoid scanned images)
- β Use the A/B testing framework to find optimal settings for your documents
- β Check relevance scores to understand result quality
- β Use Collections to organize documents by topic/project
- β Enable reranking for important queries (at the cost of some latency)
- PDF Only: Currently supports PDF files only (no DOCX, TXT, etc.)
- Text Extraction: Quality depends on PDF structure (scanned PDFs won't work)
- Context Window: Limited to top-K chunks (may miss relevant information)
- English Focused: Best performance with English documents
- Local Storage: Vector store persisted locally (not suitable for multi-user production)
Completed β :
- Hybrid search (BM25 + semantic) with RRF fusion
- Re-ranking support (Cohere cloud + Jina local) with auto-selection
- Conversation history and follow-up questions
- Retrieval presets (High Precision / Balanced / High Recall)
- Score visibility for search results
- A/B testing framework for retrieval methods
- Interactive "How It Works" documentation page
- Multi-document collections with isolated searches
- Sidebar navigation across all pages
- Documents panel on home page
- Per-collection search history persistence
- Shared UI components (DRY refactoring)
- Comprehensive test suite (290+ tests)
- Document persistence across app restarts
Future enhancements planned:
- Support for multiple file formats (DOCX, TXT, HTML, Markdown)
- Document similarity and comparison features
- Export search results to CSV/JSON
- Advanced chunking strategies (sentence-based, semantic)
- Citation tracking (show exact page/paragraph references)
- Docker containerization
- Next.js production UI (Stage 2)
- FastAPI backend with Supabase (Stage 2)
- User authentication and multi-user support
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- LangChain - LLM application framework
- ChromaDB - Vector database
- Streamlit - Web application framework
- OpenAI - Embeddings and chat models
- Cohere - Re-ranking API
- Jina AI - Local re-ranking models
Harsh
- GitHub: @shrimpy8
- Repository: semantic-serach
If you encounter any issues or have questions:
- Check the Troubleshooting section
- Review existing GitHub Issues
- Create a new issue with detailed description
Made with β€οΈ using LangChain, ChromaDB, and OpenAI






