End-to-end Retrieval-Augmented Generation (RAG) system for working with large collections of unstructured documents, reports, and long-form content.
This project focuses on production concerns rather than demos: retrieval quality, re-indexing workflows, and system-level design.
- Ingests and normalizes unstructured documents
- Applies configurable chunking strategies based on content type
- Generates embeddings and indexes them in a vector database
- Performs semantic (and hybrid) retrieval with metadata filtering
- Assembles context within model limits while preserving traceability
- Streams LLM responses with document-level citations
- Supports re-indexing and tuning without re-ingesting source data
Most RAG examples stop at simple prompt + vector search demos.
This project treats RAG as a system:
- data pipelines
- retrieval strategies
- observability
- iteration based on real usage
The goal is to build something that can evolve as models, data, and product requirements change.
-
Ingestion
- File upload & normalization
- Document parsing and cleaning
- Chunking strategies tuned per document type
-
Embedding & Indexing
- Embedding generation
- Vector database storage
- Rich metadata for filtering and re-indexing
-
Retrieval
- Semantic search (optionally hybrid)
- Context window management
- Source-aware chunk selection
-
Generation
- LLM response streaming
- Citation attachment
- Token-aware context assembly
- FastAPI (API layer)
- PostgreSQL (source of truth, metadata)
- Vector database (Qdrant / FAISS)
- Background workers for ingestion & embedding
- LLMs (OpenAI-compatible APIs or local models)
- RAG is a system, not a prompt
- Data quality > prompt engineering
- Retrieval quality is iterative
- Re-indexing should be cheap
- Observability matters
This is an actively evolving project used to explore and validate production-grade RAG patterns.
Built and maintained by Andrey Keske
Applied AI Engineer focused on RAG, embeddings, and semantic search.