Skip to content

Kacper0199/Tiny-RAG-PL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TinyRAG

Tiny, Modular, Agentic RAG System Built From Scratch.

TinyRAG Interface

Overview

TinyRAG is a Retrieval-Augmented Generation (RAG) framework designed to perform query analysis and intelligent retrieval. While the architecture is language-agnostic, this specific implementation is optimized for Polish language corpora, utilizing morphological analysis for Elasticsearch, Qdrant and specialized prompts.

Unlike simple RAG implementations that blindly feed retrieved chunks to an LLM, TinyRAG employs a multi-agentic approach involving Query Decomposition, Adaptive Routing, Smart Filtering, and Hallucination Validation.

The entire system runs locally using Ollama for inference, making it private, secure, and cost-effective. It is designed to work with local models such as Qwen 2.5, Llama 3.1, or Bielik.

Key Features

  • Hybrid Search: Combines Elasticsearch (Lexical/BM25 with Polish morphological analysis) and Qdrant (Semantic/Vector search) using Reciprocal Rank Fusion (RRF).
  • Agentic Reasoning Pipeline:
    • Decomposer Agent: Breaks down complex user queries into sub-questions.
    • Router Agent: Dynamically assigns weights to Lexical vs. Semantic search based on query type (e.g., factual vs. abstract).
    • Relevance Filter Agent: Analyzes retrieved documents before context construction to discard irrelevant noise.
    • Validator Agent: Verifies the final answer against the context to prevent hallucinations.
  • Memory System: Logs unresolved queries or hallucinations into pending.json for human-in-the-loop review.
  • Client-Server Architecture: Decouples heavy inference logic (API) from the lightweight Terminal UI.
  • Three Operation Modes: Interactive TUI, Python Library, and REST API.

System Architecture & Methodology

TinyRAG is not just a wrapper around a vector database, it's a fully orchestrated pipeline where multiple AI Agents collaborate to solve a user's query.

1. The Corpus Structure

The system is designed to handle diverse datasets stored in .jsonl format. In this reference implementation, we utilize two distinct corpora to demonstrate scalability:

  • Small Corpus (articles_30.jsonl): A curated set of 30 news articles covering mixed topics (migration, aviation, local news). Perfect for debugging and quick validation.
  • Large Corpus (culturax_pl_clean...): Bigger dataset containing over 10,000 documents from the CulturaX Polish subset. This tests the retrieval system's ability to find a needle in a haystack.

2. Search Engines

TinyRAG employs a Hybrid Search strategy to capture both exact matches and semantic meaning.

  • Elasticsearch (The Lexical Engine):

    • Configured with the morfologik plugin for Polish language stemming and lemmatization.
    • Responsible for finding exact keywords, acronyms (e.g., "PZERiI"), identifiers (e.g., "G3440"), and proper names.
    • Implemented in rag/retrieval/elastic.py.
  • Qdrant (The Semantic Engine):

    • Stores dense vector embeddings generated by sentence-transformers (model: all-MiniLM-L6-v2).
    • Responsible for understanding concepts, context, and intent, even if keywords don't match exactly.
    • Implemented in rag/retrieval/qdrant.py.
  • Reciprocal Rank Fusion (RRF):

    • The results from both engines are merged using a weighted RRF algorithm. The weights are not static; they are dynamically adjusted per query by the Router Agent.
    • Implemented in rag/retrieval/fusion.py.

3. The Agentic Workforce

TinyRAG orchestrates four specialized LLM agents defined in rag/reasoning/.

A. Decomposer Agent (rag/reasoning/decomposition.py)

Complex questions often fail in vector search because the query vector is averaged over too many topics.

  • Role: Analyzes the user's input and breaks it down into granular, atomic sub-questions.
  • Example:
    • User: "Jakie zmiany w bagażu wprowadza Ryanair?"
    • Decomposer: "1. Jakie są nowe zasady bagażu Ryanair? 2. Czy zmieniły się opłaty za bagaż?"

B. Router Agent (rag/retrieval/router.py)

Not all questions are equal. Some need exact keyword matches, others need conceptual understanding.

  • Role: Analyzes each sub-question and assigns weights (es vs qdrant).
  • Logic:
    • Factual (IDs, acronyms) -> Boost Elasticsearch (e.g., ES=0.8, Qdrant=0.2).
    • Abstract (concepts, "how to") -> Boost Qdrant (e.g., ES=0.3, Qdrant=0.7).

C. Smart Filter Agent (rag/reasoning/filtering.py)

Standard RAGs often feed the top-k documents directly to the LLM, polluting the context with irrelevant data that happens to share keywords.

  • Role: Reads the content of the top retrieved candidates.
  • Action: Decides is_relevant: true/false for each document relative to the query.
  • Result: Only high-quality documents enter the final context window. Irrelevant ones are discarded and logged.

D. Validator Agent (rag/reasoning/validation.py)

The final line of defense against hallucinations.

  • Role: After the Generator produces an answer, the Validator cross-checks it against the provided context context.
  • Action: If the answer contains facts not present in the source text, it flags the response as a Hallucination.
  • Safe Mode: In "Safe Mode", a failed validation triggers a retry loop with a stricter prompt before giving up.

Project Structure (Files & Classes)

tiny_rag/
├── config/
│   ├── config.yaml              # Main system configuration (URLs, models, limits)
│   └── prompts.yaml             # System prompts for all AI agents (in Polish)
├── corpuses/                  # Data directory
│   ├── articles_30.jsonl
│   └── culturax_pl_clean_10k_reach.jsonl
├── images/                      # Assets for README
├── interfaces/
│   ├── api.py                   # FastAPI backend implementation
│   └── tui.py                   # Textual-based Terminal User Interface
├── memory/
│   └── pending.json             # Log for failed/unresolved queries
├── rag/
│   ├── core.py                  # TinyRAG class: The main orchestrator
│   ├── indexing.py              # Indexer class: Handles data ingestion
│   ├── llm.py                   # LLMClient: Wrapper for Ollama API
│   ├── reasoning/
│   │   ├── decomposition.py     # Decomposer class
│   │   ├── filtering.py         # SmartFilter class
│   │   └── validation.py        # Validator class
│   └── retrieval/
│       ├── elastic.py           # ElasticRetriever class
│       ├── qdrant.py            # QdrantRetriever class
│       ├── fusion.py            # weighted_rrf function
│       └── router.py            # RouterAgent class
├── scripts/
│   ├── index_data.py            # Script to populate vector stores
│   ├── run_api.sh               # Launch API only
│   ├── run_app.sh               # Launch API + TUI (Recommended)
│   └── setup.sh                 # Initial environment setup
├── docker-compose.yaml          # Vector DBs orchestration
├── Dockerfile                   # Custom Elasticsearch image
├── pyproject.toml               # Dependency management
└── main.py                      # Simple CLI entry point

Prerequisites

Before running TinyRAG, ensure you have the following installed:

  1. Docker & Docker Compose: For running Elasticsearch and Qdrant.
  2. Python 3.11+: The project uses modern Python features.
  3. uv: An extremely fast Python package installer and resolver.
  4. Ollama: For running the Local LLM.

1. Model Setup (Ollama)

TinyRAG defaults to qwen2.5:14b, which offers an excellent balance of reasoning capabilities and Polish language support.

ollama pull qwen2.5:14b
ollama pull all-minilm

Note: You can change the model in config/config.yaml.

2. Environment Installation

Use the provided setup script to create the virtual environment, install dependencies, and start the necessary Docker containers.

chmod +x scripts/*.sh
./scripts/setup.sh

This script will:

  1. Create a .venv using uv.
  2. Install the project in editable mode.
  3. Build and start the Docker containers (Elasticsearch with Morfologik plugin and Qdrant).
  4. Wait for the databases to initialize.

3. Data Indexing

Once the environment is up, index the provided corpora. This process generates embeddings and pushes data to both Elasticsearch and Qdrant.

source .venv/bin/activate
python scripts/index_data.py

Usage Modes

TinyRAG provides three distinct ways to interact with the system.

1. Terminal User Interface (TUI)

This is the recommended mode. It launches the backend API in the background and connects a beautiful, responsive terminal interface to it. It visualizes the entire reasoning process, including decomposition, routing decisions, and document validation stats.

To run:

./scripts/run_app.sh

Interface Overview:

The interface is built with Textual and supports themes (Dracula by default).

  • Chat View: Displays the conversation history.
  • Thought Process: Shows how the query was decomposed and how weights were assigned.
  • Evidence: Lists kept and rejected documents with reasons.
  • Validation: Indicates if the answer passed the fact-check.

TUI Menu

Keyboard Shortcuts:

  • PageUp / PageDown: Scroll the chat history.
  • Home / End: Jump to the top/bottom.

TUI Shortcuts

2. Python API (Programmatic)

You can use TinyRAG directly in your Python scripts or Jupyter Notebooks. This is useful for batch processing or debugging.

import yaml
from rag.core import TinyRAG

with open("config/config.yaml") as f:
    cfg = yaml.safe_load(f)
with open("config/prompts.yaml") as f:
    prm = yaml.safe_load(f)

rag = TinyRAG(cfg, prm)

response = rag.query(
    user_input="Co znaleziono w samochodzie w Lublinie?",
    query_type="factual",
    corpus="small",
    mode="safe"
)

print(f"Answer: {response['answer']}")
print(f"Validation: {response['validation']}")

3. REST API

For integration with other applications, you can run the standalone API server.

Start the server:

./scripts/run_api.sh

Query the API:

curl -X POST "http://127.0.0.1:8000/rag" \
     -H "Content-Type: application/json" \
     -d '{
           "query": "Jakie zmiany w bagażu wprowadza Ryanair?",
           "corpus": "small",
           "mode": "safe"
         }'

Configuration

Core Settings (config/config.yaml)

Define your infrastructure endpoints, model selection, and search parameters here.

system:
  es_url: "http://localhost:9200"
  ollama_url: "http://localhost:11434/api/generate"
  llm_model: "qwen2.5:14b"  # Change to llama3.1:8b or bielik if needed

search:
  retrieval_limit: 15     # Docs fetched per search engine
  chunk_size: 500         # Context window chunking
  final_context_limit: 5  # Max docs passed to LLM after filtering

Troubleshooting

macOS / Apple Silicon Issues: If you encounter ValueError: bad value(s) in fds_to_keep or process crashes, it is due to a conflict between multiprocessing (used by tokenizers/torch) and the asyncio loop of the TUI.

The provided scripts (run_app.sh, run_api.sh) automatically apply the necessary fixes:

export JOBLIB_MULTIPROCESSING=0
export LOKY_MAX_CPU_COUNT=1
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
export TOKENIZERS_PARALLELISM=false

Ensure you always run the application via these scripts rather than calling Python directly if you are on macOS.

Elasticsearch Connection Refused: The ES container takes about 30-60 seconds to fully start because it loads the Morfologik plugin. Ensure curl http://localhost:9200 returns 200 OK before running the indexer.

About

Tiny, Modular, Agentic RAG System Built From Scratch.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors