Tiny, Modular, Agentic RAG System Built From Scratch.
TinyRAG is a Retrieval-Augmented Generation (RAG) framework designed to perform query analysis and intelligent retrieval. While the architecture is language-agnostic, this specific implementation is optimized for Polish language corpora, utilizing morphological analysis for Elasticsearch, Qdrant and specialized prompts.
Unlike simple RAG implementations that blindly feed retrieved chunks to an LLM, TinyRAG employs a multi-agentic approach involving Query Decomposition, Adaptive Routing, Smart Filtering, and Hallucination Validation.
The entire system runs locally using Ollama for inference, making it private, secure, and cost-effective. It is designed to work with local models such as Qwen 2.5, Llama 3.1, or Bielik.
- Hybrid Search: Combines Elasticsearch (Lexical/BM25 with Polish morphological analysis) and Qdrant (Semantic/Vector search) using Reciprocal Rank Fusion (RRF).
- Agentic Reasoning Pipeline:
- Decomposer Agent: Breaks down complex user queries into sub-questions.
- Router Agent: Dynamically assigns weights to Lexical vs. Semantic search based on query type (e.g., factual vs. abstract).
- Relevance Filter Agent: Analyzes retrieved documents before context construction to discard irrelevant noise.
- Validator Agent: Verifies the final answer against the context to prevent hallucinations.
- Memory System: Logs unresolved queries or hallucinations into
pending.jsonfor human-in-the-loop review. - Client-Server Architecture: Decouples heavy inference logic (API) from the lightweight Terminal UI.
- Three Operation Modes: Interactive TUI, Python Library, and REST API.
TinyRAG is not just a wrapper around a vector database, it's a fully orchestrated pipeline where multiple AI Agents collaborate to solve a user's query.
The system is designed to handle diverse datasets stored in .jsonl format. In this reference implementation, we utilize two distinct corpora to demonstrate scalability:
- Small Corpus (
articles_30.jsonl): A curated set of 30 news articles covering mixed topics (migration, aviation, local news). Perfect for debugging and quick validation. - Large Corpus (
culturax_pl_clean...): Bigger dataset containing over 10,000 documents from the CulturaX Polish subset. This tests the retrieval system's ability to find a needle in a haystack.
TinyRAG employs a Hybrid Search strategy to capture both exact matches and semantic meaning.
-
Elasticsearch (The Lexical Engine):
- Configured with the
morfologikplugin for Polish language stemming and lemmatization. - Responsible for finding exact keywords, acronyms (e.g., "PZERiI"), identifiers (e.g., "G3440"), and proper names.
- Implemented in
rag/retrieval/elastic.py.
- Configured with the
-
Qdrant (The Semantic Engine):
- Stores dense vector embeddings generated by
sentence-transformers(model:all-MiniLM-L6-v2). - Responsible for understanding concepts, context, and intent, even if keywords don't match exactly.
- Implemented in
rag/retrieval/qdrant.py.
- Stores dense vector embeddings generated by
-
Reciprocal Rank Fusion (RRF):
- The results from both engines are merged using a weighted RRF algorithm. The weights are not static; they are dynamically adjusted per query by the Router Agent.
- Implemented in
rag/retrieval/fusion.py.
TinyRAG orchestrates four specialized LLM agents defined in rag/reasoning/.
Complex questions often fail in vector search because the query vector is averaged over too many topics.
- Role: Analyzes the user's input and breaks it down into granular, atomic sub-questions.
- Example:
- User: "Jakie zmiany w bagażu wprowadza Ryanair?"
- Decomposer: "1. Jakie są nowe zasady bagażu Ryanair? 2. Czy zmieniły się opłaty za bagaż?"
Not all questions are equal. Some need exact keyword matches, others need conceptual understanding.
- Role: Analyzes each sub-question and assigns weights (
esvsqdrant). - Logic:
- Factual (IDs, acronyms) -> Boost Elasticsearch (e.g., ES=0.8, Qdrant=0.2).
- Abstract (concepts, "how to") -> Boost Qdrant (e.g., ES=0.3, Qdrant=0.7).
Standard RAGs often feed the top-k documents directly to the LLM, polluting the context with irrelevant data that happens to share keywords.
- Role: Reads the content of the top retrieved candidates.
- Action: Decides
is_relevant: true/falsefor each document relative to the query. - Result: Only high-quality documents enter the final context window. Irrelevant ones are discarded and logged.
The final line of defense against hallucinations.
- Role: After the Generator produces an answer, the Validator cross-checks it against the provided context context.
- Action: If the answer contains facts not present in the source text, it flags the response as a Hallucination.
- Safe Mode: In "Safe Mode", a failed validation triggers a retry loop with a stricter prompt before giving up.
tiny_rag/
├── config/
│ ├── config.yaml # Main system configuration (URLs, models, limits)
│ └── prompts.yaml # System prompts for all AI agents (in Polish)
├── corpuses/ # Data directory
│ ├── articles_30.jsonl
│ └── culturax_pl_clean_10k_reach.jsonl
├── images/ # Assets for README
├── interfaces/
│ ├── api.py # FastAPI backend implementation
│ └── tui.py # Textual-based Terminal User Interface
├── memory/
│ └── pending.json # Log for failed/unresolved queries
├── rag/
│ ├── core.py # TinyRAG class: The main orchestrator
│ ├── indexing.py # Indexer class: Handles data ingestion
│ ├── llm.py # LLMClient: Wrapper for Ollama API
│ ├── reasoning/
│ │ ├── decomposition.py # Decomposer class
│ │ ├── filtering.py # SmartFilter class
│ │ └── validation.py # Validator class
│ └── retrieval/
│ ├── elastic.py # ElasticRetriever class
│ ├── qdrant.py # QdrantRetriever class
│ ├── fusion.py # weighted_rrf function
│ └── router.py # RouterAgent class
├── scripts/
│ ├── index_data.py # Script to populate vector stores
│ ├── run_api.sh # Launch API only
│ ├── run_app.sh # Launch API + TUI (Recommended)
│ └── setup.sh # Initial environment setup
├── docker-compose.yaml # Vector DBs orchestration
├── Dockerfile # Custom Elasticsearch image
├── pyproject.toml # Dependency management
└── main.py # Simple CLI entry point
Before running TinyRAG, ensure you have the following installed:
- Docker & Docker Compose: For running Elasticsearch and Qdrant.
- Python 3.11+: The project uses modern Python features.
- uv: An extremely fast Python package installer and resolver.
- Ollama: For running the Local LLM.
TinyRAG defaults to qwen2.5:14b, which offers an excellent balance of reasoning capabilities and Polish language support.
ollama pull qwen2.5:14b
ollama pull all-minilmNote: You can change the model in config/config.yaml.
Use the provided setup script to create the virtual environment, install dependencies, and start the necessary Docker containers.
chmod +x scripts/*.sh
./scripts/setup.shThis script will:
- Create a
.venvusinguv. - Install the project in editable mode.
- Build and start the Docker containers (Elasticsearch with Morfologik plugin and Qdrant).
- Wait for the databases to initialize.
Once the environment is up, index the provided corpora. This process generates embeddings and pushes data to both Elasticsearch and Qdrant.
source .venv/bin/activate
python scripts/index_data.pyTinyRAG provides three distinct ways to interact with the system.
This is the recommended mode. It launches the backend API in the background and connects a beautiful, responsive terminal interface to it. It visualizes the entire reasoning process, including decomposition, routing decisions, and document validation stats.
To run:
./scripts/run_app.shInterface Overview:
The interface is built with Textual and supports themes (Dracula by default).
- Chat View: Displays the conversation history.
- Thought Process: Shows how the query was decomposed and how weights were assigned.
- Evidence: Lists kept and rejected documents with reasons.
- Validation: Indicates if the answer passed the fact-check.
Keyboard Shortcuts:
PageUp/PageDown: Scroll the chat history.Home/End: Jump to the top/bottom.
You can use TinyRAG directly in your Python scripts or Jupyter Notebooks. This is useful for batch processing or debugging.
import yaml
from rag.core import TinyRAG
with open("config/config.yaml") as f:
cfg = yaml.safe_load(f)
with open("config/prompts.yaml") as f:
prm = yaml.safe_load(f)
rag = TinyRAG(cfg, prm)
response = rag.query(
user_input="Co znaleziono w samochodzie w Lublinie?",
query_type="factual",
corpus="small",
mode="safe"
)
print(f"Answer: {response['answer']}")
print(f"Validation: {response['validation']}")For integration with other applications, you can run the standalone API server.
Start the server:
./scripts/run_api.shQuery the API:
curl -X POST "http://127.0.0.1:8000/rag" \
-H "Content-Type: application/json" \
-d '{
"query": "Jakie zmiany w bagażu wprowadza Ryanair?",
"corpus": "small",
"mode": "safe"
}'Define your infrastructure endpoints, model selection, and search parameters here.
system:
es_url: "http://localhost:9200"
ollama_url: "http://localhost:11434/api/generate"
llm_model: "qwen2.5:14b" # Change to llama3.1:8b or bielik if needed
search:
retrieval_limit: 15 # Docs fetched per search engine
chunk_size: 500 # Context window chunking
final_context_limit: 5 # Max docs passed to LLM after filteringmacOS / Apple Silicon Issues:
If you encounter ValueError: bad value(s) in fds_to_keep or process crashes, it is due to a conflict between multiprocessing (used by tokenizers/torch) and the asyncio loop of the TUI.
The provided scripts (run_app.sh, run_api.sh) automatically apply the necessary fixes:
export JOBLIB_MULTIPROCESSING=0
export LOKY_MAX_CPU_COUNT=1
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
export TOKENIZERS_PARALLELISM=falseEnsure you always run the application via these scripts rather than calling Python directly if you are on macOS.
Elasticsearch Connection Refused:
The ES container takes about 30-60 seconds to fully start because it loads the Morfologik plugin. Ensure curl http://localhost:9200 returns 200 OK before running the indexer.


