🚀 Production RAG Pipeline with LangChain & FAISS

A complete, production-ready Retrieval-Augmented Generation (RAG) system built with modular Python architecture. This project demonstrates how to build a scalable document processing and intelligent question-answering system using modern AI tools.

🎯 Project Overview

This RAG pipeline provides:

Multi-format document processing (PDF, TXT, CSV, Excel, Word, JSON)
Intelligent document chunking with overlap for better context preservation
FAISS vector database for efficient similarity search and persistence
Sentence Transformers for high-quality embeddings
Groq LLM integration for fast, accurate text generation
Modular architecture with separate components for easy maintenance and scaling

✨ Key Features

🔄 Complete RAG Workflow

Document Ingestion: Automatic loading of multiple file formats
Text Processing: Smart chunking with configurable size and overlap
Vector Storage: FAISS-based vector database with persistence
Semantic Search: Similarity search with configurable top-k results
LLM Generation: Context-aware response generation with Groq

🏗️ Production Architecture

Modular Design: Separate components for data loading, embedding, vector storage, and search
Persistent Storage: FAISS index and metadata saved to disk
Error Handling: Comprehensive error handling across all components
Configurable Parameters: Easy customization of chunk sizes, models, and search parameters
Environment Management: UV-based dependency management

📚 Multi-Format Support

PDF Documents: Using PyPDF and PyMuPDF loaders
Text Files: Plain text document processing
Structured Data: CSV, Excel, and JSON file support
Word Documents: DOCX file processing
Metadata Preservation: Source tracking and document metadata

� Supported Document Types

Document Type	Extensions	Loader Used	Description	Use Cases
PDF	`.pdf`	`PyPDFLoader`	Extracts text from PDF documents with metadata	Research papers, reports, manuals, books
Plain Text	`.txt`	`TextLoader`	Direct text file processing	Code files, logs, documentation, notes
CSV	`.csv`	`CSVLoader`	Structured data with column headers	Data exports, spreadsheets, databases
Excel	`.xlsx`, `.xls`	`UnstructuredExcelLoader`	Excel workbooks and spreadsheets	Financial data, reports, structured datasets
Word	`.docx`	`Docx2txtLoader`	Microsoft Word documents	Business documents, reports, articles
JSON	`.json`	`JSONLoader`	Structured JSON data	API responses, configuration files, datasets

Document Processing Features

✅ Automatic Format Detection: Based on file extensions
✅ Metadata Extraction: Source file, page numbers, timestamps
✅ Error Handling: Graceful handling of corrupted or unsupported files
✅ Batch Processing: Process entire directories at once
✅ Content Validation: Ensures non-empty content before processing
✅ Encoding Support: UTF-8 and various character encodings

🔄 RAG Pipeline Flow

graph TD
    A[📁 Document Sources] --> B[🔍 Document Loader]
    B --> C{📋 Format Detection}
    C --> D[📄 PDF Loader]
    C --> E[📝 Text Loader]
    C --> F[📊 CSV Loader]
    C --> G[📑 Word Loader]
    C --> H[🗂️ JSON Loader]
    C --> I[📈 Excel Loader]

    D --> J[📖 Raw Documents]
    E --> J
    F --> J
    G --> J
    H --> J
    I --> J

    J --> K[✂️ Text Chunking]
    K --> L[🧠 Embedding Generation]
    L --> M[🗄️ FAISS Vector Store]

    N[❓ User Query] --> O[🧠 Query Embedding]
    O --> P[🔍 Similarity Search]
    M --> P
    P --> Q[📋 Retrieved Contexts]
    Q --> R[🤖 LLM Generation]
    R --> S[✨ Final Answer]

    style A fill:#e1f5fe
    style S fill:#e8f5e8
    style M fill:#fff3e0
    style R fill:#f3e5f5

Pipeline Stages Explained

1. 📥 Document Ingestion

# Automatically detects and loads multiple formats
docs = load_all_documents("data/")
# Supports: PDF, TXT, CSV, Excel, Word, JSON

2. ✂️ Text Chunking

# Smart chunking with overlap for context preservation
chunks = splitter.split_documents(documents)
# Default: 1000 chars with 200 char overlap

3. 🧠 Embedding Generation

# High-quality embeddings using Sentence Transformers
embeddings = model.embed_documents([chunk.page_content for chunk in chunks])
# Model: all-MiniLM-L6-v2 (384 dimensions)

4. 🗄️ Vector Storage

# Efficient storage and retrieval with FAISS
vectorstore.add_documents(chunks)
# Includes metadata and persistent storage

5. 🔍 Semantic Search

# Query processing and similarity search
similar_docs = vectorstore.similarity_search(query, k=top_k)
# Returns most relevant document chunks

6. 🤖 Answer Generation

# Context-aware response generation
response = llm.invoke(prompt_with_context)
# Uses Groq LLM with retrieved context

Data Flow Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Raw Documents │───▶│  Text Chunking  │───▶│   Embeddings    │
│                 │    │                 │    │                 │
│ • PDFs          │    │ • Smart Split   │    │ • 384D Vectors  │
│ • Text Files    │    │ • Overlap       │    │ • Normalized    │
│ • Structured    │    │ • Metadata      │    │ • Semantic      │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  Error Handling │    │  Chunk Metadata │    │ FAISS Index     │
│                 │    │                 │    │                 │
│ • Format Check  │    │ • Source Info   │    │ • Persistence   │
│ • Validation    │    │ • Page Numbers  │    │ • Fast Search   │
│ • Logging       │    │ • Timestamps    │    │ • Scalability   │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Performance Characteristics

Stage	Time Complexity	Memory Usage	Optimization Tips
Document Loading	O(n) files	Low	Batch processing for large collections
Text Chunking	O(m) content	Medium	Adjust chunk size based on content type
Embedding	O(k) chunks	High	Use lighter models for speed
Vector Storage	O(k log k)	Medium	FAISS handles large-scale efficiently
Search	O(log k)	Low	Increase top_k for better recall
Generation	O(context)	Medium	Optimize prompt length

�📁 Project Structure

langchain-rag/
├── src/                          # Core application modules
│   ├── __init__.py              # Package initialization
│   ├── data_loader.py           # Multi-format document loading
│   ├── embedding.py             # Text chunking and embedding generation
│   ├── vectorstore.py           # FAISS vector database management
│   └── search.py                # RAG search and summarization
├── notebook/                    # Jupyter notebooks for development
│   ├── document.ipynb           # Basic document processing examples
│   └── pdf_loader.ipynb         # Advanced RAG pipeline with ChromaDB
├── data/                        # Document storage
│   ├── text_files/              # Sample text documents
│   ├── pdf_files/               # PDF documents
│   └── [other formats]/         # CSV, Excel, Word, JSON files
├── faiss_store/                 # Persistent FAISS vector database
│   ├── faiss.index             # FAISS vector index
│   └── metadata.pkl             # Document metadata
├── app.py                       # Main application entry point
├── main.py                      # Alternative entry point
├── pyproject.toml               # Project configuration and dependencies
├── requirements.txt             # Package requirements
├── .env                         # Environment variables
└── README.md                    # This file

🛠️ Technology Stack

Core Libraries

LangChain: Document processing and text splitting
FAISS: Vector similarity search and indexing
Sentence Transformers: State-of-the-art embeddings
Groq: Fast LLM inference
NumPy: Numerical operations for embeddings

Document Processing

PyPDF/PyMuPDF: PDF document processing
python-docx: Word document handling
pandas: CSV and Excel file processing
json: JSON document parsing

Development Tools

UV: Modern Python package management
Jupyter: Interactive development environment
python-dotenv: Environment variable management
Pydantic: Configuration and data validation

📋 Prerequisites

Python 3.13+: Latest Python version
UV Package Manager: For dependency management
Groq API Key: For LLM functionality
Git: Version control

🚀 Installation & Setup

1. Clone the Repository

git clone https://github.com/aayushjainx/LangChain-RAG.git
cd langchain-rag

2. Install UV (if not already installed)

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# or via Homebrew
brew install uv

3. Set up Python Environment

# Create virtual environment with Python 3.13
uv venv --python 3.13

# Install dependencies
uv sync

4. Environment Configuration

Create a .env file in the project root:

# Required: Groq API Key
GROQ_API_KEY=your_groq_api_key_here

# Optional: Other API keys (if you plan to extend the project)
OPENAI_API_KEY=your_openai_api_key_here

Get Groq API Key:

Visit console.groq.com
Sign up/Sign in
Navigate to API Keys
Create new API key

5. Prepare Your Documents

Add your documents to the data/ directory:

data/
├── text_files/          # .txt files
├── pdf_files/           # .pdf files
├── csv_files/           # .csv files
├── excel_files/         # .xlsx files
├── word_files/          # .docx files
└── json_files/          # .json files

🏃‍♂️ Quick Start

Method 1: Using UV (Recommended)

# Run the main application
uv run python app.py

# Or run with main.py
uv run python main.py

Method 2: Activate Virtual Environment

# Activate environment
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate    # Windows

# Run application
python app.py

Method 3: Jupyter Notebooks

# Start Jupyter
uv run jupyter notebook

# Open notebooks for interactive development
# - notebook/document.ipynb (basic examples)
# - notebook/pdf_loader.ipynb (advanced pipeline)

🔧 Core Components

1. Document Loader (`src/data_loader.py`)

from src.data_loader import load_all_documents

# Load all supported documents from data directory
docs = load_all_documents("data")
print(f"Loaded {len(docs)} documents")

Supported Formats:

PDF (.pdf) - via PyPDFLoader
Text (.txt) - via TextLoader
CSV (.csv) - via CSVLoader
Excel (.xlsx) - via UnstructuredExcelLoader
Word (.docx) - via Docx2txtLoader
JSON (.json) - via JSONLoader

2. Embedding Pipeline (`src/embedding.py`)

from src.embedding import EmbeddingPipeline

# Initialize embedding pipeline
emb_pipeline = EmbeddingPipeline(
    model_name="all-MiniLM-L6-v2",
    chunk_size=1000,
    chunk_overlap=200
)

# Process documents
chunks = emb_pipeline.chunk_documents(documents)
embeddings = emb_pipeline.embed_chunks(chunks)

3. Vector Store (`src/vectorstore.py`)

from src.vectorstore import FaissVectorStore

# Initialize FAISS vector store
vectorstore = FaissVectorStore(
    persist_dir="faiss_store",
    embedding_model="all-MiniLM-L6-v2"
)

# Build from documents (first time)
vectorstore.build_from_documents(documents)

# Load existing store
vectorstore.load()

# Query the store
results = vectorstore.query("What is machine learning?", top_k=5)

4. RAG Search (`src/search.py`)

from src.search import RAGSearch

# Initialize RAG system
rag = RAGSearch(
    persist_dir="faiss_store",
    embedding_model="all-MiniLM-L6-v2",
    llm_model="llama-3.1-8b-instant"
)

# Search and generate summary
query = "What is machine learning?"
summary = rag.search_and_summarize(query, top_k=3)
print("Summary:", summary)

💡 Usage Examples

Basic RAG Workflow

from src.data_loader import load_all_documents
from src.vectorstore import FaissVectorStore
from src.search import RAGSearch

# 1. Load documents
docs = load_all_documents("data")

# 2. Initialize and build vector store (first time only)
vectorstore = FaissVectorStore("faiss_store")
vectorstore.build_from_documents(docs)

# 3. Initialize RAG search
rag_search = RAGSearch()

# 4. Query the system
query = "What are the key features of Python?"
summary = rag_search.search_and_summarize(query, top_k=3)
print("Answer:", summary)

Advanced Usage

# Custom configuration
rag = RAGSearch(
    persist_dir="custom_store",
    embedding_model="all-mpnet-base-v2",  # Different embedding model
    llm_model="llama-3.1-70b-versatile"   # Different LLM model
)

# Multi-query processing
queries = [
    "What is machine learning?",
    "How does Python handle data science?",
    "What are the applications of AI?"
]

for query in queries:
    result = rag.search_and_summarize(query, top_k=5)
    print(f"Q: {query}")
    print(f"A: {result}\n")

⚙️ Configuration Options

Embedding Models (via Sentence Transformers)

all-MiniLM-L6-v2 (default) - Fast, good quality
all-mpnet-base-v2 - Higher quality, slower
all-distilroberta-v1 - Balanced performance

LLM Models (via Groq)

llama-3.1-8b-instant (default) - Fast inference
llama-3.1-70b-versatile - Higher quality
mixtral-8x7b-32768 - Alternative model
gemma2-9b-it - Compact model

Text Chunking Parameters

chunk_size: Size of text chunks (default: 1000)
chunk_overlap: Overlap between chunks (default: 200)
separators: Text splitting separators

🧪 Development & Testing

Running Tests

# Add pytest for testing
uv add --dev pytest

# Run tests
uv run pytest

Code Formatting

# Add development tools
uv add --dev black isort flake8

# Format code
uv run black src/
uv run isort src/

# Lint code
uv run flake8 src/

Adding New Dependencies

# Add new package
uv add package-name

# Add development dependency
uv add --dev package-name

# Update requirements
uv export --no-hashes > requirements.txt

🔍 Monitoring & Debugging

The application provides detailed logging:

[INFO] - General information and progress
[DEBUG] - Detailed debugging information
[ERROR] - Error messages and stack traces

Common Issues & Solutions

ModuleNotFoundError: Use uv run python app.py instead of python app.py
API Key Issues: Ensure .env file contains valid GROQ_API_KEY
Memory Issues: Reduce chunk_size or process fewer documents
Empty Results: Check if documents are in the correct data/ subdirectories

📈 Performance Optimization

For Large Document Collections

Use smaller chunk sizes (500-800)
Consider all-MiniLM-L6-v2 for faster embeddings
Implement batch processing for very large collections

For Better Quality

Use all-mpnet-base-v2 embeddings
Increase chunk overlap (300-400)
Use llama-3.1-70b-versatile for generation

🤝 Contributing

Fork the repository
Create feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

LangChain - Document processing framework
FAISS - Efficient similarity search
Sentence Transformers - High-quality embeddings
Groq - Fast LLM inference
UV - Modern Python package management

📞 Support

Documentation: Check component docstrings and code comments
Issues: Open GitHub issues for bugs or feature requests
API Documentation:

🎉 Happy coding! Build amazing RAG applications with this production-ready pipeline!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
faiss_store		faiss_store
notebook		notebook
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
app.py		app.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

aayushjainx/FileBased-RAG-LangChain

Folders and files

Latest commit

History

Repository files navigation