A Python-based hierarchical retrieval-augmented generation (RAG) system using LangChain and Chroma vector database with support for multiple AI providers.
This project implements a sophisticated hierarchical RAG system that can process and query PDF documents using advanced retrieval techniques. The system uses Chroma as the vector database for document storage and retrieval, with support for both OpenAI and Azure OpenAI providers.
hrm/
├── chroma/ # Chroma vector database files
│ ├── *.bin # Database index files
│ ├── chroma.sqlite3 # SQLite database
│ └── parents.jsonl # Document hierarchy metadata
├── data/ # Documents for processing
│ ├── *.pdf # Research papers and documents
│ ├── *.md # Markdown files
│ └── *.txt # Text files
├── hier_rag_langchain.py # Main hierarchical RAG implementation
├── hier_rag_langchain-query.py # Query interface implementation
├── env.example # Environment configuration template
├── requirements.txt # Python dependencies
├── readme.txt # Quick start guide
└── README.md # This file
- Hierarchical Document Processing: Organizes documents in a hierarchical structure for better retrieval
- Multi-Format Document Support: Processes PDF, Markdown, and text files
- Two-Stage Retrieval: Global summary index narrows search to candidate sections, then fine-grained index retrieves precise spans
- Multiple AI Provider Support: Compatible with OpenAI and Azure OpenAI
- Chroma Vector Database: Uses Chroma for efficient vector storage and similarity search
- LangChain Integration: Leverages LangChain for RAG pipeline implementation
- Document Compression: LLM compression + embedding de-duplication via DocumentCompressorPipeline
- Streaming Responses: Real-time streaming of AI responses
- Lightweight Citations: Clean citation system for source attribution
- Clone the repository:
git clone <repository-url>
cd hrm- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
cp env.example .env
# Edit .env with your API keys and configurationCreate a .env file based on env.example:
For OpenAI:
PROVIDER=openai
OPENAI_API_KEY=sk-...
For Azure OpenAI:
PROVIDER=azure
AZURE_OPENAI_API_KEY=...
AZURE_OPENAI_ENDPOINT=https://.openai.azure.com/
AZURE_OPENAI_API_VERSION=2024-02-15-preview
AZURE_OPENAI_CHAT_DEPLOYMENT=gpt-4o-mini
AZURE_OPENAI_EMBED_DEPLOYMENT=text-embedding-3-large
- Place your documents in the
data/directory (supports PDF, Markdown, and text files) - Build the index:
python hier_rag_langchain.py --build ./data- Ask questions:
python hier_rag_langchain.py --ask "What does the spec say about token limits and rate limiting?"The system provides two main scripts:
hier_rag_langchain.py: Main implementation with CLI interfacehier_rag_langchain-query.py: Query interface implementation
The system automatically:
- Processes documents from the
data/directory - Creates vector embeddings using the configured AI provider
- Builds a hierarchical structure for document organization
- Provides advanced query capabilities with two-stage retrieval
Key dependencies include:
langchain>=0.2.11: RAG pipeline frameworklangchain-community>=0.2.11: Community integrationslangchain-openai>=0.2.5: OpenAI provider integrationlangchain-chroma>=0.1.4: Chroma vector database integrationchromadb>=0.5.4: Vector database for document storagetiktoken: Token counting and text processingpypdf: PDF document processingpython-dotenv: Environment variable management
- Input: PDF, Markdown, and text documents in the
data/directory - Storage: Chroma vector database in the
chroma/directory - Output: Hierarchical document structure with advanced query capabilities
The system implements a sophisticated two-stage retrieval approach:
- Global Summary Index: Narrows search to candidate sections
- Fine-Grained Index: Retrieves precise spans from those sections
- Document Compression: Uses LLM compression and embedding de-duplication
- LCEL Answer Chain: Clean chain with lightweight citations
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
For questions, support, or contributions, please contact: info@openagi.news