A complete, production-ready Retrieval-Augmented Generation (RAG) system built with modular Python architecture. This project demonstrates how to build a scalable document processing and intelligent question-answering system using modern AI tools.
This RAG pipeline provides:
- Multi-format document processing (PDF, TXT, CSV, Excel, Word, JSON)
- Intelligent document chunking with overlap for better context preservation
- FAISS vector database for efficient similarity search and persistence
- Sentence Transformers for high-quality embeddings
- Groq LLM integration for fast, accurate text generation
- Modular architecture with separate components for easy maintenance and scaling
- Document Ingestion: Automatic loading of multiple file formats
- Text Processing: Smart chunking with configurable size and overlap
- Vector Storage: FAISS-based vector database with persistence
- Semantic Search: Similarity search with configurable top-k results
- LLM Generation: Context-aware response generation with Groq
- Modular Design: Separate components for data loading, embedding, vector storage, and search
- Persistent Storage: FAISS index and metadata saved to disk
- Error Handling: Comprehensive error handling across all components
- Configurable Parameters: Easy customization of chunk sizes, models, and search parameters
- Environment Management: UV-based dependency management
- PDF Documents: Using PyPDF and PyMuPDF loaders
- Text Files: Plain text document processing
- Structured Data: CSV, Excel, and JSON file support
- Word Documents: DOCX file processing
- Metadata Preservation: Source tracking and document metadata
| Document Type | Extensions | Loader Used | Description | Use Cases |
|---|---|---|---|---|
.pdf |
PyPDFLoader |
Extracts text from PDF documents with metadata | Research papers, reports, manuals, books | |
| Plain Text | .txt |
TextLoader |
Direct text file processing | Code files, logs, documentation, notes |
| CSV | .csv |
CSVLoader |
Structured data with column headers | Data exports, spreadsheets, databases |
| Excel | .xlsx, .xls |
UnstructuredExcelLoader |
Excel workbooks and spreadsheets | Financial data, reports, structured datasets |
| Word | .docx |
Docx2txtLoader |
Microsoft Word documents | Business documents, reports, articles |
| JSON | .json |
JSONLoader |
Structured JSON data | API responses, configuration files, datasets |
- ✅ Automatic Format Detection: Based on file extensions
- ✅ Metadata Extraction: Source file, page numbers, timestamps
- ✅ Error Handling: Graceful handling of corrupted or unsupported files
- ✅ Batch Processing: Process entire directories at once
- ✅ Content Validation: Ensures non-empty content before processing
- ✅ Encoding Support: UTF-8 and various character encodings
graph TD
A[📁 Document Sources] --> B[🔍 Document Loader]
B --> C{📋 Format Detection}
C --> D[📄 PDF Loader]
C --> E[📝 Text Loader]
C --> F[📊 CSV Loader]
C --> G[📑 Word Loader]
C --> H[🗂️ JSON Loader]
C --> I[📈 Excel Loader]
D --> J[📖 Raw Documents]
E --> J
F --> J
G --> J
H --> J
I --> J
J --> K[✂️ Text Chunking]
K --> L[🧠 Embedding Generation]
L --> M[🗄️ FAISS Vector Store]
N[❓ User Query] --> O[🧠 Query Embedding]
O --> P[🔍 Similarity Search]
M --> P
P --> Q[📋 Retrieved Contexts]
Q --> R[🤖 LLM Generation]
R --> S[✨ Final Answer]
style A fill:#e1f5fe
style S fill:#e8f5e8
style M fill:#fff3e0
style R fill:#f3e5f5
# Automatically detects and loads multiple formats
docs = load_all_documents("data/")
# Supports: PDF, TXT, CSV, Excel, Word, JSON# Smart chunking with overlap for context preservation
chunks = splitter.split_documents(documents)
# Default: 1000 chars with 200 char overlap# High-quality embeddings using Sentence Transformers
embeddings = model.embed_documents([chunk.page_content for chunk in chunks])
# Model: all-MiniLM-L6-v2 (384 dimensions)# Efficient storage and retrieval with FAISS
vectorstore.add_documents(chunks)
# Includes metadata and persistent storage# Query processing and similarity search
similar_docs = vectorstore.similarity_search(query, k=top_k)
# Returns most relevant document chunks# Context-aware response generation
response = llm.invoke(prompt_with_context)
# Uses Groq LLM with retrieved context┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Raw Documents │───▶│ Text Chunking │───▶│ Embeddings │
│ │ │ │ │ │
│ • PDFs │ │ • Smart Split │ │ • 384D Vectors │
│ • Text Files │ │ • Overlap │ │ • Normalized │
│ • Structured │ │ • Metadata │ │ • Semantic │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Error Handling │ │ Chunk Metadata │ │ FAISS Index │
│ │ │ │ │ │
│ • Format Check │ │ • Source Info │ │ • Persistence │
│ • Validation │ │ • Page Numbers │ │ • Fast Search │
│ • Logging │ │ • Timestamps │ │ • Scalability │
└─────────────────┘ └─────────────────┘ └─────────────────┘
| Stage | Time Complexity | Memory Usage | Optimization Tips |
|---|---|---|---|
| Document Loading | O(n) files | Low | Batch processing for large collections |
| Text Chunking | O(m) content | Medium | Adjust chunk size based on content type |
| Embedding | O(k) chunks | High | Use lighter models for speed |
| Vector Storage | O(k log k) | Medium | FAISS handles large-scale efficiently |
| Search | O(log k) | Low | Increase top_k for better recall |
| Generation | O(context) | Medium | Optimize prompt length |
langchain-rag/
├── src/ # Core application modules
│ ├── __init__.py # Package initialization
│ ├── data_loader.py # Multi-format document loading
│ ├── embedding.py # Text chunking and embedding generation
│ ├── vectorstore.py # FAISS vector database management
│ └── search.py # RAG search and summarization
├── notebook/ # Jupyter notebooks for development
│ ├── document.ipynb # Basic document processing examples
│ └── pdf_loader.ipynb # Advanced RAG pipeline with ChromaDB
├── data/ # Document storage
│ ├── text_files/ # Sample text documents
│ ├── pdf_files/ # PDF documents
│ └── [other formats]/ # CSV, Excel, Word, JSON files
├── faiss_store/ # Persistent FAISS vector database
│ ├── faiss.index # FAISS vector index
│ └── metadata.pkl # Document metadata
├── app.py # Main application entry point
├── main.py # Alternative entry point
├── pyproject.toml # Project configuration and dependencies
├── requirements.txt # Package requirements
├── .env # Environment variables
└── README.md # This file
- LangChain: Document processing and text splitting
- FAISS: Vector similarity search and indexing
- Sentence Transformers: State-of-the-art embeddings
- Groq: Fast LLM inference
- NumPy: Numerical operations for embeddings
- PyPDF/PyMuPDF: PDF document processing
- python-docx: Word document handling
- pandas: CSV and Excel file processing
- json: JSON document parsing
- UV: Modern Python package management
- Jupyter: Interactive development environment
- python-dotenv: Environment variable management
- Pydantic: Configuration and data validation
- Python 3.13+: Latest Python version
- UV Package Manager: For dependency management
- Groq API Key: For LLM functionality
- Git: Version control
git clone https://github.com/aayushjainx/LangChain-RAG.git
cd langchain-rag# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# or via Homebrew
brew install uv# Create virtual environment with Python 3.13
uv venv --python 3.13
# Install dependencies
uv syncCreate a .env file in the project root:
# Required: Groq API Key
GROQ_API_KEY=your_groq_api_key_here
# Optional: Other API keys (if you plan to extend the project)
OPENAI_API_KEY=your_openai_api_key_hereGet Groq API Key:
- Visit console.groq.com
- Sign up/Sign in
- Navigate to API Keys
- Create new API key
Add your documents to the data/ directory:
data/
├── text_files/ # .txt files
├── pdf_files/ # .pdf files
├── csv_files/ # .csv files
├── excel_files/ # .xlsx files
├── word_files/ # .docx files
└── json_files/ # .json files# Run the main application
uv run python app.py
# Or run with main.py
uv run python main.py# Activate environment
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
# Run application
python app.py# Start Jupyter
uv run jupyter notebook
# Open notebooks for interactive development
# - notebook/document.ipynb (basic examples)
# - notebook/pdf_loader.ipynb (advanced pipeline)from src.data_loader import load_all_documents
# Load all supported documents from data directory
docs = load_all_documents("data")
print(f"Loaded {len(docs)} documents")Supported Formats:
- PDF (
.pdf) - via PyPDFLoader - Text (
.txt) - via TextLoader - CSV (
.csv) - via CSVLoader - Excel (
.xlsx) - via UnstructuredExcelLoader - Word (
.docx) - via Docx2txtLoader - JSON (
.json) - via JSONLoader
from src.embedding import EmbeddingPipeline
# Initialize embedding pipeline
emb_pipeline = EmbeddingPipeline(
model_name="all-MiniLM-L6-v2",
chunk_size=1000,
chunk_overlap=200
)
# Process documents
chunks = emb_pipeline.chunk_documents(documents)
embeddings = emb_pipeline.embed_chunks(chunks)from src.vectorstore import FaissVectorStore
# Initialize FAISS vector store
vectorstore = FaissVectorStore(
persist_dir="faiss_store",
embedding_model="all-MiniLM-L6-v2"
)
# Build from documents (first time)
vectorstore.build_from_documents(documents)
# Load existing store
vectorstore.load()
# Query the store
results = vectorstore.query("What is machine learning?", top_k=5)from src.search import RAGSearch
# Initialize RAG system
rag = RAGSearch(
persist_dir="faiss_store",
embedding_model="all-MiniLM-L6-v2",
llm_model="llama-3.1-8b-instant"
)
# Search and generate summary
query = "What is machine learning?"
summary = rag.search_and_summarize(query, top_k=3)
print("Summary:", summary)from src.data_loader import load_all_documents
from src.vectorstore import FaissVectorStore
from src.search import RAGSearch
# 1. Load documents
docs = load_all_documents("data")
# 2. Initialize and build vector store (first time only)
vectorstore = FaissVectorStore("faiss_store")
vectorstore.build_from_documents(docs)
# 3. Initialize RAG search
rag_search = RAGSearch()
# 4. Query the system
query = "What are the key features of Python?"
summary = rag_search.search_and_summarize(query, top_k=3)
print("Answer:", summary)# Custom configuration
rag = RAGSearch(
persist_dir="custom_store",
embedding_model="all-mpnet-base-v2", # Different embedding model
llm_model="llama-3.1-70b-versatile" # Different LLM model
)
# Multi-query processing
queries = [
"What is machine learning?",
"How does Python handle data science?",
"What are the applications of AI?"
]
for query in queries:
result = rag.search_and_summarize(query, top_k=5)
print(f"Q: {query}")
print(f"A: {result}\n")all-MiniLM-L6-v2(default) - Fast, good qualityall-mpnet-base-v2- Higher quality, slowerall-distilroberta-v1- Balanced performance
llama-3.1-8b-instant(default) - Fast inferencellama-3.1-70b-versatile- Higher qualitymixtral-8x7b-32768- Alternative modelgemma2-9b-it- Compact model
chunk_size: Size of text chunks (default: 1000)chunk_overlap: Overlap between chunks (default: 200)separators: Text splitting separators
# Add pytest for testing
uv add --dev pytest
# Run tests
uv run pytest# Add development tools
uv add --dev black isort flake8
# Format code
uv run black src/
uv run isort src/
# Lint code
uv run flake8 src/# Add new package
uv add package-name
# Add development dependency
uv add --dev package-name
# Update requirements
uv export --no-hashes > requirements.txtThe application provides detailed logging:
[INFO]- General information and progress[DEBUG]- Detailed debugging information[ERROR]- Error messages and stack traces
- ModuleNotFoundError: Use
uv run python app.pyinstead ofpython app.py - API Key Issues: Ensure
.envfile contains validGROQ_API_KEY - Memory Issues: Reduce
chunk_sizeor process fewer documents - Empty Results: Check if documents are in the correct
data/subdirectories
- Use smaller chunk sizes (500-800)
- Consider
all-MiniLM-L6-v2for faster embeddings - Implement batch processing for very large collections
- Use
all-mpnet-base-v2embeddings - Increase chunk overlap (300-400)
- Use
llama-3.1-70b-versatilefor generation
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- LangChain - Document processing framework
- FAISS - Efficient similarity search
- Sentence Transformers - High-quality embeddings
- Groq - Fast LLM inference
- UV - Modern Python package management
- Documentation: Check component docstrings and code comments
- Issues: Open GitHub issues for bugs or feature requests
- API Documentation:
🎉 Happy coding! Build amazing RAG applications with this production-ready pipeline!