Skip to content

SiddarthCharan05/TechCorp-PolicyCopilot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TechCorp PolicyCopilot - Complete RAG Pipeline

A complete Retrieval-Augmented Generation (RAG) system that demonstrates how to build an end-to-end pipeline for document-based question answering. This project implements a RAG system that can answer questions about company policies by retrieving relevant information from a knowledge base and generating contextual responses.

🎯 Overview

This project showcases a production-ready RAG pipeline that:

  • Loads and chunks documents for efficient retrieval
  • Generates embeddings using sentence transformers
  • Stores documents in a persistent vector database (ChromaDB)
  • Performs semantic search to find relevant context
  • Augments prompts with retrieved context
  • Generates accurate, context-aware responses

πŸ—οΈ Architecture

The RAG pipeline consists of six main components:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  User Query     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Query Encoding │────▢│  Vector Search   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Context        │◀────│  Document Chunks β”‚
β”‚  Augmentation   β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Response       β”‚
β”‚  Generation     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Components

  1. Document Loader (src/document/loader.py)

    • Loads documents from various sources
    • Splits documents into chunks using recursive character text splitting
    • Configurable chunk size and overlap
  2. Embedding Model (src/embeddings/model.py)

    • Uses SentenceTransformer models for generating embeddings
    • Default model: all-MiniLM-L6-v2 (384 dimensions)
    • Encodes both documents and queries into vector space
  3. Vector Store (src/vector/store.py)

    • Persistent ChromaDB storage
    • Cosine similarity for retrieval
    • Automatic deduplication and indexing
  4. Query Processor (src/query/processor.py)

    • Encodes user queries into embeddings
    • Performs similarity search in vector database
    • Returns top-k most relevant document chunks
  5. Response Generator (src/response/generator.py)

    • Augments prompts with retrieved context
    • Generates context-aware responses
    • Currently uses a simulated LLM (ready for OpenAI/Anthropic integration)
  6. Main Pipeline (src/main.py)

    • Orchestrates all components
    • Handles initialization and query processing
    • Manages persistent state

πŸ“‹ Prerequisites

  • Python 3.8 or higher
  • pip package manager

πŸš€ Installation

  1. Clone the repository (if applicable):

    git clone <repository-url>
    cd complete-rag
  2. Create a virtual environment (recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Download spaCy language model (required for text processing):

    python -m spacy download en_core_web_sm

πŸ“– Usage

Running the Demo

The easiest way to get started is to run the demo script:

python run_demo.py

This will:

  1. Initialize the RAG system (loads documents, creates embeddings, sets up vector DB)
  2. Process several example queries
  3. Display the complete pipeline workflow for each query

Programmatic Usage

You can also use the RAG system programmatically:

from src.main import initialize_rag_system, process_query

# Initialize the system (one-time setup)
collection, model = initialize_rag_system()

# Process queries
response = process_query(
    "What's the reimbursement policy for home office equipment?",
    collection,
    model
)
print(response)

Custom Queries

To process your own queries, modify the test_queries list in run_demo.py or call process_query() directly with your question.

πŸ“ Project Structure

complete-rag/
β”œβ”€β”€ README.md                 # This file
β”œβ”€β”€ requirements.txt          # Python dependencies
β”œβ”€β”€ run_demo.py              # Demo entry point
β”œβ”€β”€ chroma_db/               # Persistent vector database (auto-generated)
β”‚   └── ...
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ main.py              # Main pipeline orchestration
β”‚   β”œβ”€β”€ config.py            # Configuration constants
β”‚   β”œβ”€β”€ document/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── loader.py        # Document loading and chunking
β”‚   β”œβ”€β”€ embeddings/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── model.py         # Embedding model management
β”‚   β”œβ”€β”€ vector/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── store.py         # Vector database operations
β”‚   β”œβ”€β”€ query/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── processor.py     # Query processing and search
β”‚   └── response/
β”‚       β”œβ”€β”€ __init__.py
β”‚       └── generator.py     # Response generation
└── techcorp-docs/           # Sample documents (optional)
    β”œβ”€β”€ customer-faqs/
    β”œβ”€β”€ employee-handbook/
    β”œβ”€β”€ meeting-notes/
    └── product-specs/

βš™οΈ Configuration

All configuration is centralized in src/config.py:

# Vector Database
DEFAULT_PERSIST_DIRECTORY = "./chroma_db"
COLLECTION_NAME = "techcorp_policies"
SIMILARITY_METRIC = "cosine"

# Embedding Model
EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"

# Chunking
CHUNK_SIZE = 200
CHUNK_OVERLAP = 50
CHUNK_SEPARATORS = ["\n\n", "\n", " ", ""]

# Search
DEFAULT_TOP_K = 3

Customization

  • Chunk Size: Adjust CHUNK_SIZE to control document chunk granularity
  • Top-K Results: Change DEFAULT_TOP_K to retrieve more/fewer context chunks
  • Embedding Model: Switch to a different SentenceTransformer model for better accuracy
  • Similarity Metric: Use "l2" or "ip" instead of "cosine" if needed

πŸ”„ How It Works

Initialization Phase (One-Time)

  1. Document Loading: Loads sample policy documents
  2. Chunking: Splits documents into smaller, manageable chunks (200 chars with 50 char overlap)
  3. Embedding Generation: Converts each chunk into a vector embedding
  4. Vector Storage: Stores embeddings in ChromaDB with metadata
  5. Model Loading: Loads the SentenceTransformer model for query encoding

Query Processing Phase (Per Query)

  1. Query Encoding: Converts the user query into an embedding vector
  2. Vector Search: Finds top-k most similar document chunks using cosine similarity
  3. Context Augmentation: Assembles retrieved chunks into context for the LLM
  4. Response Generation: Generates a response based on the augmented prompt

Persistence

The vector database is persisted to disk (./chroma_db/), so:

  • Initial setup only runs once
  • Subsequent runs reuse existing embeddings
  • No need to re-index documents unless they change

πŸ§ͺ Example Queries

The demo includes these example queries:

  1. "What's the reimbursement policy for home office equipment?"
  2. "Can I get money back for buying a desk?"
  3. "How much can I claim for my home office?"
  4. "What's the travel expense policy?"
  5. "How many vacation days do I get?"

πŸ”§ Dependencies

Key dependencies include:

  • chromadb: Vector database for storing and searching embeddings
  • sentence-transformers: Embedding model library
  • langchain: Text splitting utilities
  • scikit-learn: Machine learning utilities
  • spacy: Natural language processing

See requirements.txt for the complete list with versions.

🚧 Future Enhancements

Potential improvements for production use:

  • Real LLM Integration: Replace simulated response with OpenAI/Anthropic API calls
  • Document Loaders: Support for PDF, Word, and other document formats
  • Advanced Chunking: Implement semantic chunking strategies
  • [ ] Hybrid Search: Combine vector search with keyword/BM25 search
  • Query Expansion: Improve query understanding with query rewriting
  • Response Citations: Add source citations to generated responses
  • Web Interface: Build a web UI for easier interaction
  • Batch Processing: Support for processing multiple queries efficiently
  • Evaluation Metrics: Add RAG evaluation metrics (retrieval accuracy, response quality)

πŸ“ Notes

  • The current implementation uses a simulated LLM for response generation. To use a real LLM, modify src/response/generator.py to call your preferred API (OpenAI, Anthropic, etc.).
  • The vector database persists between runs, so you only need to re-index when documents change.
  • Sample documents are currently hardcoded in src/document/loader.py. You can extend this to load from files or external sources.

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“„ License

[Specify your license here]

πŸ™ Acknowledgments


TechCorp PolicyCopilot - Making policy information accessible through AI-powered search and retrieval.

About

Enterprise policy RAG assistant using document retrieval and semantic search. Provides grounded answers from company policy documents to reduce hallucinations and support compliance workflows.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages