PolicyRAG is an advanced Retrieval-Augmented Generation (RAG) system specifically designed for institutional policy document management and question-answering. It combines state-of-the-art vector search capabilities with local language model generation to provide accurate, contextual answers to policy-related queries without relying on external APIs.
PolicyRAG follows a microservices architecture with five main components working together to provide intelligent policy retrieval and question answering:
- Document Collection Service: Automated web scraping from PowerDMS
- Processing Pipeline: Text extraction, preprocessing, and embedding generation
- Vector Database: Elasticsearch cluster for semantic search
- LLM Service: Local Ollama instance serving Gemma2:2b model
- Web Interface: Real-time Flask application with chat interface
Step 1: Web Scraping
- The system begins by accessing the PowerDMS portal using Selenium WebDriver
- ChromeDriver navigates through the document tree structure automatically
- PDF documents are identified and downloaded to the local storage directory
- A 5-minute timeout ensures the scraping process doesn't run indefinitely
Step 2: Text Extraction
- PyPDF2 library processes each downloaded PDF file
- Text content is extracted page by page and combined into a single document
- The system handles various PDF formats and encoding issues
- Extracted text undergoes initial validation for content quality
Step 3: Text Preprocessing
- NLTK performs stopword removal using English language corpus
- WordNet lemmatization reduces words to their root forms
- Regular expressions clean special characters while preserving important punctuation
- Text is chunked into manageable segments for embedding generation
Step 1: Tokenization
- BAAI/bge-large-en-v1.5 tokenizer converts text into tokens
- Maximum sequence length is limited to 512 tokens per chunk
- Padding and truncation ensure consistent input dimensions
- Token attention masks are generated for proper model processing
Step 2: Embedding Creation
- The BGE model processes tokenized text through transformer layers
- Last token pooling extracts meaningful representations from the final hidden states
- Each text chunk is converted into a 1024-dimensional dense vector
- Embeddings capture semantic meaning and context relationships
Step 3: Quality Validation
- System validates embedding dimensions match expected 1024 values
- Numerical checks ensure all vector components are valid floating-point numbers
- Duplicate detection prevents redundant document indexing
Step 1: Index Management
- Elasticsearch creates a "policy" index with proper vector field mappings
- HNSW (Hierarchical Navigable Small World) algorithm enables fast approximate nearest neighbor search
- Cosine similarity is configured as the primary distance metric
- Index settings optimize for both search speed and accuracy
Step 2: Document Indexing
- Bulk indexing operations efficiently store multiple documents simultaneously
- Each document contains metadata (ID, PDF path), original text, and embedding vector
- Elasticsearch automatically creates inverted indices for text search capabilities
- Document versioning tracks updates and modifications
Step 1: Query Reception
- User submits a natural language question through the web chat interface
- SocketIO establishes real-time bidirectional communication
- Query preprocessing applies the same text cleaning pipeline used for documents
Step 2: Query Embedding
- User query undergoes identical tokenization and embedding generation process
- The same BGE model creates a 1024-dimensional query vector
- Query vector represents the semantic intent and meaning of the user's question
Step 3: Similarity Search
- Elasticsearch performs vector similarity search using cosine distance
- The system retrieves top-k most relevant documents (typically 5-10)
- Hybrid search combines vector similarity with traditional keyword matching
- Results are ranked by relevance scores for optimal context selection
Step 4: Context Assembly
- Retrieved documents are processed through cumulative relevance scoring
- System concatenates document texts until relevance threshold is reached (typically cumulative score > 4.5)
- Context window management ensures LLM input stays within token limits
- Document sources are preserved for transparency and citation
Step 1: Prompt Construction
- System creates a structured prompt combining user query with retrieved context
- Prompt engineering includes role definition and instruction formatting
- Context documents are clearly delineated with separators
- Instructions guide the LLM to focus on policy-specific information
Step 2: LLM Processing
- Ollama serves the local Gemma2:2b model for response generation
- Model processes the combined prompt and context through transformer layers
- Local deployment ensures data privacy and eliminates external API dependencies
- Generation parameters control response length and creativity
Step 3: Response Delivery
- Generated response is validated for coherence and relevance
- SocketIO streams the response back to the user interface in real-time
- Response time metrics are captured for performance monitoring
- Complete query-response cycle is logged for audit and improvement
Kubernetes Pod Communication
- All services communicate through internal Kubernetes DNS resolution
- Service discovery enables dynamic endpoint resolution across pods
- Network policies ensure secure inter-service communication
- Load balancing distributes requests across multiple replicas when scaled
Secret Management Flow
- HashiCorp Vault stores all sensitive credentials and configuration
- Vault Agent Injector automatically mounts secrets as files in pods
- Application startup scripts source these secret files as environment variables
- Secret rotation occurs transparently without application restarts
Document Storage
- Persistent Volumes store downloaded PDF documents across pod restarts
- Document scraper jobs populate shared storage accessible by processing pods
- Volume claims ensure data durability and availability
Vector Database Persistence
- Elasticsearch uses StatefulSets with persistent storage for data durability
- Index data persists across cluster restarts and node failures
- Snapshot and restore capabilities enable backup and disaster recovery
Model Persistence
- Ollama model files are stored in persistent volumes
- Model initialization jobs download required models once per deployment
- Shared model storage enables multiple LLM service replicas
- Individual microservices can scale independently based on load
- Elasticsearch cluster can expand with additional data nodes
- Multiple application replicas handle increased user traffic
- Load balancers distribute requests evenly across service instances
- Query result caching reduces redundant embedding computations
- LRU (Least Recently Used) caching for frequently accessed documents
- Redis integration enables distributed caching across multiple application instances
- Document processing occurs in configurable batch sizes
- Bulk Elasticsearch operations improve indexing throughput
- Parallel processing utilizes multi-core systems efficiently
This comprehensive data flow ensures PolicyRAG delivers accurate, contextual responses while maintaining high performance, security, and scalability for institutional policy management needs.
- Python 3.12: Core application development
- Flask + SocketIO: Real-time web application framework
- Elasticsearch 8.12: Vector database and search engine
- PyTorch: Machine learning framework for embeddings
- Transformers: Hugging Face library for NLP models
- Ollama: Local LLM serving with Gemma2:2b model
- Docker: Containerization with multi-stage builds
- Kubernetes: Container orchestration with StatefulSets and Jobs
- HashiCorp Vault: Secret management and credential rotation
- Minikube: Local Kubernetes development environment
- Selenium WebDriver: Automated document collection
- PyPDF2: PDF text extraction and processing
- NLTK: Natural language preprocessing (stopwords, lemmatization)
- NumPy: Numerical operations for vector computations
- HTML5/CSS3: Modern web interface
- Bootstrap 5: Responsive UI framework
- JavaScript/Socket.IO: Real-time bidirectional communication
- 24/7 Availability: No human intervention required for basic policy queries
- Consistent Responses: Eliminates variations in policy interpretation
- Cost Effective: Reduces helpdesk burden and manual policy research
- Audit Trail: Complete logging of all queries and responses
- Privacy Preserving: Local deployment keeps sensitive data on-premises
# Required software
- Python 3.12+
- Docker Desktop
- Git
- Chrome Browser (for web scraping)
# Optional but recommended
- Minikube (for local Kubernetes testing)
- kubectl
- Helm 3.x-
Clone the Repository
git clone https://github.com/Mik-27/PolicyRAG.git cd PolicyRAG -
Create Virtual Environment
python -m venv .venv # Windows .venv\Scripts\activate # Linux/Mac source .venv/bin/activate
-
Install Dependencies
pip install -r requirements.txt
-
Environment Configuration Create a
.envfile (for local development only):# Elasticsearch Configuration ELASTIC_PASSWORD=your_elastic_password ELASTIC_CLOUD_ID=your_cloud_id # For Elastic Cloud ELASTIC_API_KEY=your_api_key # For Elastic Cloud # Hugging Face (for model downloads) HF_ACCESS_TOKEN=your_hf_token # Ollama Configuration OLLAMA_HOST=http://localhost:11434
-
Download Required NLTK Data
import nltk nltk.download('stopwords') nltk.download('wordnet')
-
Start Elasticsearch
# Using Docker docker run -d \ --name elasticsearch \ -p 9200:9200 \ -p 9300:9300 \ -e "discovery.type=single-node" \ -e "xpack.security.enabled=false" \ docker.elastic.co/elasticsearch/elasticsearch:8.12.2
-
Start Ollama
# Install Ollama locally or use Docker docker run -d \ --name ollama \ -p 11434:11434 \ ollama/ollama:latest # Pull the Gemma model docker exec ollama ollama pull gemma2:2b
-
Run the Application
python app.py
-
Access the Interface Navigate to
http://localhost:5000/chat
# Install Minikube for local development
minikube start --cpus=4 --memory=8192MB --driver=docker
# Install Helm for package management
helm repo add hashicorp https://helm.releases.hashicorp.com
helm repo update# Create Vault namespace
kubectl create namespace vault
# Deploy Vault in development mode
helm install vault hashicorp/vault \
--namespace vault \
--set "server.dev.enabled=true" \
--set "injector.enabled=true"
# Configure Vault for Kubernetes authentication
kubectl exec -it vault-0 -n vault -- /bin/sh -c "
vault login root
vault auth enable kubernetes
vault write auth/kubernetes/config \
kubernetes_host=\"https://kubernetes.default.svc\" \
token_reviewer_jwt=\"\$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)\" \
kubernetes_ca_cert=@/var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
issuer=\"https://kubernetes.default.svc.cluster.local\"
"
# Create secrets and policies
kubectl exec -it vault-0 -n vault -- /bin/sh -c "
vault secrets enable -path=policyrag kv-v2
vault kv put policyrag/config \
ELASTIC_PASSWORD=\"your_password\" \
HF_ACCESS_TOKEN=\"your_token\" \
ELASTIC_API_KEY=\"your_api_key\" \
ELASTIC_CLOUD_ID=\"your_cloud_id\"
vault policy write policyrag - <<EOF
path \"policyrag/data/*\" {
capabilities = [\"read\"]
}
EOF
vault write auth/kubernetes/role/policyrag \
bound_service_account_names=policyrag \
bound_service_account_namespaces=default \
policies=policyrag \
ttl=24h
"# Build and load the application image
docker build -t policyrag:latest .
minikube image load policyrag:latest
# Deploy all components
kubectl apply -f k8s/sa.yaml # Service accounts
kubectl apply -f k8s/configmap.yaml # Configuration
kubectl apply -f k8s/elasticsearch.yaml # Vector database
kubectl apply -f k8s/ollama.yaml # LLM serving
kubectl apply -f k8s/policyrag-pvc-service.yaml # Storage and networking
kubectl apply -f k8s/model-init-job.yaml # Model initialization
kubectl apply -f k8s/scrape-vectorize-job.yaml # Document processing
kubectl apply -f k8s/policyrag-deployment.yaml # Main application
kubectl apply -f k8s/policyrag-ingress.yaml # External access
# Monitor deployment
kubectl get pods -w
kubectl logs -f deployment/policyrag# Port forwarding for development
kubectl port-forward svc/policyrag 5000:5000
# Or use Minikube service
minikube service policyrag --url
# For production, configure ingress with proper DNSThe application uses HashiCorp Vault for secure secret management:
# Vault agent injector annotations
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "policyrag"
vault.hashicorp.com/agent-inject-secret-config: "policyrag/data/config"
vault.hashicorp.com/agent-inject-template-config: |
{{- with secret "policyrag/data/config" -}}
export ELASTIC_PASSWORD="{{ .Data.data.ELASTIC_PASSWORD }}"
export HF_ACCESS_TOKEN="{{ .Data.data.HF_ACCESS_TOKEN }}"
export ELASTIC_API_KEY="{{ .Data.data.ELASTIC_API_KEY }}"
export ELASTIC_CLOUD_ID="{{ .Data.data.ELASTIC_CLOUD_ID }}"
{{- end -}}# Development
ELASTIC_HOST=http://localhost:9200
OLLAMA_HOST=http://localhost:11434
# Docker Compose
ELASTIC_HOST=http://elasticsearch:9200
OLLAMA_HOST=http://ollama:11434
# Kubernetes
ELASTIC_HOST=http://elasticsearch:9200
OLLAMA_HOST=http://ollama:11434- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 style guidelines
- Add unit tests for new functionality
- Update documentation for API changes
- Ensure Docker builds pass
- Test Kubernetes deployments
- Elasticsearch for providing robust search and analytics capabilities
- Hugging Face for the transformers library and pre-trained models
- Ollama for simplified local LLM deployment
- HashiCorp for Vault secret management solutions
- BAAI for the BGE embedding models
Built with ❤️ for institutional policy management and AI-powered knowledge retrieval