Enterprise-grade Kubernetes-native Retrieval Augmented Generation (RAG) platform for deploying scalable AI solutions on K8s clusters.
KubeRAG provides a comprehensive, production-ready solution for deploying RAG applications on Kubernetes with support for multiple LLM providers and vector databases. It features automatic scaling, monitoring, and seamless integration with existing Kubernetes infrastructure.
- Features
- Architecture
- Prerequisites
- Quick Start
- Project Structure
- Core Components
- API Documentation
- Configuration
- Deployment Options
- Advanced Usage
- Contributing
- License
- Multi-LLM Support: Seamlessly integrate with Azure OpenAI, OpenAI, Anthropic, Google Gemini, and Ollama
- Vector Store Flexibility: Choose from 8 different vector databases including Qdrant, MongoDB, ChromaDB, FAISS, PostgreSQL, Elasticsearch, Neo4j, and LanceDB
- Production Ready: Built for enterprise deployments with high availability, auto-scaling, and comprehensive monitoring
- Document Processing: Support for PDF, DOCX, Markdown, CSV, and plain text with intelligent chunking
- Kubernetes Native: Designed specifically for K8s with proper resource management and service discovery
- RESTful APIs: Well-documented REST endpoints for easy integration
- Embedding Models: Flexible embedding model support with automatic model downloading
- Horizontal pod autoscaling
- Persistent volume support
- ConfigMap and Secret management
- Ingress controller support
- Health checks and readiness probes
- Service mesh compatibility
- Multi-replica deployments
- Resource quotas and limits
KubeRAG follows a microservices architecture with three main components:
┌─────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Ingress │────▶│ Agent │───▶│ Pipeline │ │
│ │ Controller │ │ Service │ │ Service │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ LLM │ │ Embedding │ │
│ │ Providers │ │ Models │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────────────────────┐ │
│ │ Vector Store │ │
│ │ (Qdrant/MongoDB/etc) │ │
│ └────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
- Agent Service: Handles chat interactions, query processing, and response generation
- Pipeline Service: Manages document ingestion, text extraction, chunking, and embedding
- Vector Store: Stores and retrieves document embeddings for similarity search
- Kubernetes cluster (v1.19+)
- Helm 3.x
- kubectl configured
- Docker (for building custom images)
- Minimum 4GB RAM per node
- Storage class for persistent volumes (if using FAISS/LanceDB)
git clone https://github.com/yourusername/kuberag.git
cd kuberag# Build Pipeline Service
cd orchestrator/data_pipeline
docker buildx build . --platform=linux/amd64,linux/arm64 -t your-registry/kuberag-pipeline:v1.0.0
docker push your-registry/kuberag-pipeline:v1.0.0
# Build Agent Service
cd ../Agent
docker buildx build . --platform=linux/amd64,linux/arm64 -t your-registry/kuberag-agent:v1.0.0
docker push your-registry/kuberag-agent:v1.0.0Create a custom values.yaml file:
images:
agent:
repository: your-registry/kuberag-agent
tag: v1.0.0
pipeline:
repository: your-registry/kuberag-pipeline
tag: v1.0.0
llm:
provider: "openai"
openai:
apiKey: "your-api-key"
vectorStore:
type: "qdrant"
qdrant:
deploy: truehelm install kuberag ./KubeRag -f values.yamlkubectl get pods -l app.kubernetes.io/instance=kuberag
kubectl get svc -l app.kubernetes.io/instance=kuberagKubeRag/
├── orchestrator/
│ ├── Agent/
│ │ ├── app.py # FastAPI chat service
│ │ ├── agent.py # Core RAG logic
│ │ ├── llm_providers.py # LLM integrations
│ │ └── Dockerfile
│ ├── data_pipeline/
│ │ ├── universal_pipeline.py # Document processing
│ │ ├── vector_store.py # Vector store interface
│ │ ├── download_model.py # Embedding model manager
│ │ └── Dockerfile
│ ├── vector_stores/
│ │ ├── base.py # Abstract base class
│ │ ├── qdrant_store.py # Qdrant implementation
│ │ ├── mongodb_store.py # MongoDB implementation
│ │ ├── chroma_store.py # ChromaDB implementation
│ │ ├── faiss_store.py # FAISS implementation
│ │ └── ... # Other implementations
│ └── config/
│ ├── config.py # Configuration management
│ └── vector_store_config.py
├── KubeRag/ # Helm chart
│ ├── Chart.yaml
│ ├── values.yaml
│ └── templates/
│ ├── agent-deployment.yaml
│ ├── agent-service.yaml
│ ├── data-pipeline-deployment.yaml
│ ├── configmap.yaml
│ ├── secret.yaml
│ └── ...
└── tests/
├── test_e2e.py
└── conftest.py
The Agent service provides the chat interface and RAG functionality:
- FastAPI-based REST API for chat interactions
- Multi-LLM support with automatic failover
- Context-aware response generation using retrieved documents
- Session management for conversation history
- Health checks and monitoring endpoints
Key files:
app.py: FastAPI application with chat endpointsagent.py: Core RAG logic and document retrievalllm_providers.py: LLM provider factory and integrations
The Pipeline service handles document processing and embedding:
- Multi-format document support (PDF, DOCX, MD, CSV, TXT)
- Intelligent text chunking with configurable strategies
- Batch processing for large document sets
- Embedding generation using sentence transformers
- Vector store management and indexing
Key files:
universal_pipeline.py: Main pipeline servicevector_store.py: Vector store abstraction layerdownload_model.py: Embedding model downloader
Abstracted vector store implementations supporting:
- Qdrant: High-performance vector search
- MongoDB: Document store with vector capabilities
- ChromaDB: Open-source embedding database
- FAISS: Facebook's similarity search library
- PostgreSQL: With pgvector extension
- Elasticsearch: Full-text and vector search
- Neo4j: Graph database with vectors
- LanceDB: Modern columnar database
Send a message and receive an AI-generated response with sources.
Request:
{
"message": "What is KubeRAG?",
"session_id": "optional-session-id"
}Response:
{
"response": "KubeRAG is a Kubernetes-native RAG platform...",
"sources": [
{
"id": "doc-123",
"text": "Relevant document excerpt...",
"score": 0.95,
"metadata": {}
}
],
"session_id": "session-123",
"total_results": 5
}Health check endpoint.
Response:
{
"status": "healthy",
"service": "kuberag-chat-agent",
"vector_store_initialized": true,
"embedding_model_initialized": true,
"llm_providers_available": ["azure_openai", "openai"]
}Upload and process a document file.
Request:
- Method: POST
- Content-Type: multipart/form-data
- Body:
- file: Document file (PDF, DOCX, etc.)
- metadata: JSON string with metadata
- chunk: Boolean for chunking (default: true)
- chunk_config: JSON configuration for chunking
Response:
{
"status": "success",
"message": "File document.pdf processed and indexed",
"documents_processed": 1,
"chunks_created": 15,
"index_size": 1250,
"vector_store_used": "qdrant"
}Process raw text input.
Request:
{
"text": "Your document text here...",
"metadata": {
"source": "manual",
"category": "documentation"
},
"chunk": true,
"id": "optional-document-id"
}Process multiple documents in batch.
Request:
{
"documents": [
{
"text": "First document...",
"metadata": {},
"chunk": true
},
{
"text": "Second document...",
"metadata": {},
"chunk": true
}
]
}Get pipeline statistics.
Response:
{
"service": "KubeRAG Universal Data Pipeline",
"vector_store_type": "qdrant",
"embedding_model": "all-MiniLM-L12-v2",
"embedding_dimension": 384,
"chunk_size": 500,
"chunk_overlap": 50,
"index_size": 1250
}The values.yaml file provides comprehensive configuration options:
# LLM Configuration
llm:
provider: "azure_openai" # Options: openai, azure_openai, anthropic, gemini, ollama
azureOpenai:
deployment: "gpt-4o-mini"
endpoint: "https://your-endpoint.openai.azure.com"
apiKey: "" # Set via secret
# Vector Store Configuration
vectorStore:
type: "qdrant" # Options: qdrant, mongodb, chroma, faiss, postgresql, elasticsearch, neo4j, lancedb
dimension: 384
collectionName: "documents"
qdrant:
host: "qdrant-service"
port: 6333
deploy: true # Deploy Qdrant with the chart
# Embedding Configuration
embedding:
model: "all-MiniLM-L12-v2" # Any HuggingFace sentence transformer
# Text Processing
textProcessing:
chunkSize: 500
chunkOverlap: 50
method: "words" # Options: words, sentences, characters
# Deployment Settings
deployment:
agent:
replicas: 2
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
# Ingress Configuration
ingress:
enabled: true
className: "nginx"
host: "kuberag.example.com"
tls:
enabled: true
secretName: "kuberag-tls"
# Auto-scaling
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70Key environment variables for services:
# Vector Store
VECTOR_STORE_TYPE=qdrant
VECTOR_STORE_DIMENSION=384
VECTOR_STORE_COLLECTION_NAME=documents
# LLM Provider
LLM_PROVIDER=azure_openai
AZURE_OPENAI_API_KEY=your-key
AZURE_OPENAI_ENDPOINT=https://your-endpoint.openai.azure.com
# Embedding Model
EMBEDDING_MODEL=all-MiniLM-L12-v2
# Text Processing
CHUNK_SIZE=500
CHUNK_OVERLAP=50For production environments:
- Use External Secrets: Store API keys in Kubernetes secrets or external secret managers
- Enable TLS: Configure ingress with TLS certificates
- Set Resource Limits: Define appropriate resource requests and limits
- Enable Auto-scaling: Configure HPA for dynamic scaling
- Use Persistent Storage: For FAISS and LanceDB deployments
Example production values:
ingress:
enabled: true
className: "nginx"
host: "rag.yourdomain.com"
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
tls:
enabled: true
secretName: "kuberag-tls"
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
deployment:
agent:
replicas: 3
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
persistence:
enabled: true
storageClass: "fast-ssd"
size: "50Gi"For development/testing:
service:
type: "NodePort"
deployment:
agent:
replicas: 1
resources:
requests:
memory: "512Mi"
cpu: "250m"
vectorStore:
type: "faiss" # Local vector store
faiss:
deploy: trueTo use custom embedding models:
- Update the embedding model in values.yaml:
embedding:
model: "sentence-transformers/all-mpnet-base-v2"- The model will be automatically downloaded on first use
To add a new vector store:
- Create a new store class in
orchestrator/vector_stores/ - Inherit from
VectorStoreBase - Implement required methods
- Register in the factory
Example:
class CustomVectorStore(VectorStoreBase):
def initialize(self):
# Initialize connection
pass
def add(self, id, vector, payload):
# Add vector
pass
def search(self, vector, limit=5):
# Search vectors
passKubeRAG supports integration with:
- Prometheus for metrics
- Grafana for visualization
- ELK stack for logging
- Jaeger for distributed tracing
- Horizontal Scaling: Use HPA for automatic pod scaling
- Vertical Scaling: Adjust resource limits based on load
- Vector Store Scaling: Use distributed vector stores like Qdrant cluster mode
- Caching: Implement Redis for response caching
KubeRAG supports multiple LLM providers and vector stores. Here are all possible deployment combinations:
| LLM Provider | Qdrant | MongoDB | ChromaDB | FAISS | PostgreSQL | Elasticsearch | Neo4j | LanceDB | Total |
|---|---|---|---|---|---|---|---|---|---|
| Azure OpenAI | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 8 |
| OpenAI | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 8 |
| Anthropic | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 8 |
| Google Gemini | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 8 |
| Ollama (Local) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 8 |
| Total | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 40 |
- Qdrant: High-performance dedicated vector database (Recommended for production)
- MongoDB: Document database with vector search capabilities
- ChromaDB: Open-source embedding database
- FAISS: Facebook's library for efficient similarity search
- PostgreSQL: Traditional database with pgvector extension
- Elasticsearch: Search engine with vector capabilities
- Neo4j: Graph database with vector search
- LanceDB: Modern columnar vector database
- Azure OpenAI: Enterprise-grade OpenAI models with Azure security
- OpenAI: Direct OpenAI API access
- Anthropic: Claude models for advanced reasoning
- Google Gemini: Google's multimodal AI models
- Ollama: Run local models without external API dependencies
All 40 combinations are fully supported and can be deployed using the KubeRAG Helm chart with appropriate configuration.
