RAG SYSTEM - README

Author: Roberto Brunialti

WARNING: RAG-system is a python packege implementing a self-contained RAG system, usable for a variety of purposes: document retrieval, LLM context feeding, ecc...
It is quite simple to use/integrate in your own applications. It uses some of the stat-of-the-art algorithms and it is easy to tweak in order to add new ones.
A simple but powerfull UI app is also provided, to let you fine tuning the parameters on your real documents store.
This code is a Beta release. It may or may not fully meet your needs. Use it "as is" at your own risk. Consider rag_system as a workbench to better understand RAG systems and to practice with. No extensive stress or scalability tests have been performed; nevertheless, it should work effectively with stores up to many thousands chunks at least. The document ingestion module manages documents of the following types: .txt, .doc (with some limitations), .docx, .pdf. It can also process Excel files as .xls and .xlsxwith some dimensional and format constraints. Other document types can be added by modifying document_manager.py, provided that your tools can extract plain text from the desired format.

1) GENERAL DESCRIPTION

This program implements a Retrieval-Augmented Generation (RAG) system that indexes a document collection using multiple retrieval methods: a dense approach (FAISS), a sparse approach (BM25), and a persistent TF-IDF index for multi-representation retrieval. The system is designed to efficiently handle very heterogeneous documents—from very short to very long—and return relevant results for queries, optionally enriched by a cross-encoder re-ranking phase. The system returns raw text chunks that could be used as “context” for a user-defined LLM (integration with an LLM is out of scope).

In practice:

Long documents are split into coherent chunks using strategies based on sentence boundaries and word counts. The splitting is delegated to the dedicated module chunker.py, while the embedding module (embedding_manager.py) computes dense representations on these chunks.
A dense index is built using FAISS (saved as index.faiss), and a BM25 spare index is built (stored in metadata.pkl) that aggregates full document text.
A persistent TF-IDF index is maintained and saved (in tfidf.pkl) to support multi-representation retrieval.
When a query is made, the user can select:
- The search strategy ("faiss", "bm25", "ibrido", "multi", or "rrf"). A naive strategy auto-select option is also provided.
- The retrieval mode ("chunk" or "document"). Note that for some strategies (e.g. BM25) only document-level retrieval is supported.
- The preferred cross-encoder (ms-marco-MiniLM-L-6-v2, jina-reranker-v2-base-multilingual, minilm-l6-h384-italian-cross-encoder) to refine the rank of the retrieved documents. Other cross-encoders can be easily added to the package.
Optionally, a cross-encoder re-ranks the top candidates, applying min–max normalization to ensure consistency across signals.

New Functionality:

New Cross Encoder Options: The system now supports multiple cross-encoder models for re-ranking:
- cross-encoder/ms-marco-MiniLM-L-6-v2: The standard cross encoder fine-tuned on the MS MARCO passage ranking dataset, optimized for English re-ranking.
- jinaai/jina-reranker-v2-base-multilingual: A state-of-the-art multilingual cross encoder fine-tuned on diverse query–document pairs. This model is optimized for cross-lingual re-ranking and supports multiple languages beyond English. Note: This model requires setting trust_remote_code=True.
- osiria/minilm-l6-h384-italian-cross-encoder: An Italian-specific cross encoder based on the MiniLM-L6 architecture with a hidden size of 384. Fully trained on Italian text, it is designed to optimize ranking performance on Italian documents.
Enhanced Re-ranking Options: Users can now select, tramite l'interfaccia grafica, il cross encoder che meglio si adatta al proprio caso d'uso (ad esempio, scegliendo il modello multilingue per query in lingue diverse o il modello italiano per documenti in italiano).
Optional Knee Detection & MMR Re-ranking: As before, the system includes a knee detection algorithm to automatically determine a cutoff threshold when fusing multiple retrieval signals and an MMR-based re-ranking mechanism to balance relevance with diversity.

2) DESIGN AND TECHNICAL CHOICES

Dual/Multi Index:

FAISS: Captures semantic similarity via dense embeddings (using SentenceTransformer).
BM25: A classical sparse retrieval method using a bag-of-words approach (after tokenization with NLTK).
Persistent TF-IDF: Built using scikit-learn’s TfidfVectorizer over aggregated document text; stored to disk to avoid recalculation.

Retrieval Strategies:

faiss: Performs dense search via FAISS. Returns individual chunks or aggregated documents based on the retrieval mode.
bm25: Retrieves documents using BM25. (Only document-level retrieval is available.)
ibrido (hybrid): Combines FAISS and BM25 scores via a linear weighted sum. In "chunk" mode, returns the best chunk per document; in "document" mode, aggregates all chunks.
multi: Extends hybrid by fusing FAISS, BM25, and TF-IDF signals (with normalization) for robust ranking.
rrf: Applies Reciprocal Rank Fusion (RRF) to combine rankings from FAISS, BM25, and TF-IDF. RRF assigns each document a score of 1/(k+rank) and sums these scores across rankings. (See Fox & Shaw, 1994; Cormack et al., 2009.)

Retrieval Mode & Knee Detection:

Depending on the selected strategy, the retrieval mode can be "chunk" or "document". For example, BM25 forces document-level retrieval.
An optional knee detection algorithm is provided to automatically determine a cutoff threshold when fusing signals.

Re-ranking:

An optional cross-encoder re-ranking phase is applied to the top candidates.
With the new cross encoder options available, users have more flexibility in choosing the re-ranking model best suited for their language or domain requirements.
Detailed debugging information and min–max normalization are applied to ensure consistency across re-ranking scores.

3) MODULE STRUCTURE

your_app_root_directory/
├── rag_system/
│   ├── persistent_stores/         # Vector store directory (created automatically)
│   │   └── TEST/                  # Example Vector Store
│   │       ├── index.faiss        # FAISS index
│   │       ├── index_bm25.pkl     # BM25 index
│   │       ├── tfidf.pkl          # TF-IDF index
│   │       └── metadata.pkl       # Metadata index
│   ├── models/                    # Local models (installed by setup_models.py)
│   │   ├── all-distilroberta-v1/
│   │   ├── all-MiniLM-L6-v2/
│   │   ├── all-mpnet-base-v2/
│   │   ├── paraphrase-multilingual-MiniLM-L12-v2/
│   │   ├── cross-encoder
│   │   │   └── ms-marco-MiniLM-L-6-v2/
│   │   ├── osiria
│   │   │   └── minilm-l6-h384-italian-cross-encoder/
│   │   └── jinaai
│   │       └── jina-reranker-v2-base-multilingual/
│   ├── __init__.py                # Initializes the rag_system package
│   ├── bm25_retriever.py          # BM25 retrieval logic
│   ├── chunker.py                 # Dynamic text chunking functions
│   ├── README.md                  # This file
│   ├── config.py                  # Global configuration (rag_system.json)
│   ├── document_manager.py        # Document loading and duplicate detection
│   ├── embedding_manager.py       # Computes dense embeddings (delegates chunking to chunker.py)
│   ├── search_engine.py           # Search strategies: faiss, bm25, ibrido, multi, rrf; includes 
│   │                              # cross-encoder re-ranking and optional knee detection
│   ├── utils.py                   # Helper functions (e.g., MD5 computation)
│   └── vector_storage.py          # Manages indices (FAISS, BM25, persistent TF-IDF) and metadata
├── UI_manager.py                  # Graphical UI for managing the system
├── UI_manager.json                # UI configuration file
├── setup_models.py                # Downloads local models from HuggingFace and configures the system
├── requirements.txt               # Required libraries
└── config.json                    # Configuration file for UI_manager.py

4) INSTALLATION

4.1.

Drop the rag_system directory "as-is" where you want to use it.

4.2.

Install required libraries:

pip install -r requirements.txt

4.3.

Create the models directory (if it exists, delete it and run setup_models.py). Ensure setup_models.py is one level above rag_system. Then run:

python setup_models.py

Note: The persistent_stores directory will be created automatically when you create your first store.

5) HOW TO START

Once installed, the system is not fully functional until a vector store is created. To start:

Launch the UI (python UI_manager.py)
Create a vector store
Insert some test documents

These operations can also be done via a Python script; see provided samples if available.

6) REFERENCES

Introductory papers:

Additional references:

Fox, E.A., & Shaw, J.A. (1994). Combination of multiple searches. TREC.
Cormack, G.V., Clarke, C.L.A., & Buettcher, S. (2009). Reciprocal Rank Fusion.
FAISS: https://faiss.ai/
Manning, Raghavan e Schütze (2008)."An Introduction to Information Retrieval"
scikit-learn’s TfidfVectorizer for TF-IDF. https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

7) FILE DESCRIPTIONS

rag_system.json: Configuration file containing global parameters for the RAG system, including paths, embedding model key, chunk size, overlap, and other settings.
search_engine.py: Implements various search strategies (FAISS, BM25, hybrid, multi-representation, and RRF) and includes optional cross-encoder re-ranking and knee detection functionalities.
utils.py: Provides utility functions such as computing MD5 checksums for strings and files, used for duplicate detection and file integrity.
vector_storage.py: Manages persistent storage and retrieval of vectors and metadata for FAISS, BM25, and TF-IDF indices. It handles adding, removing, and rebuilding indices.
config.py: Handles loading and saving of configuration settings from the rag_system.json file, and defines default values for various system parameters.
document_manager.py: Manages document ingestion, text extraction, normalization, and duplicate detection from various file formats (e.g., txt, pdf, docx, Excel).
embedding_manager.py: Generates dense and sparse embeddings using SentenceTransformer models, handles text chunking, and normalizes embeddings for similarity calculations.
UI_manager.py: Provides a graphical user interface built with Tkinter, allowing users to create and manage vector stores, add documents, perform queries, and modify system parameters.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
examples		examples
rag_system		rag_system
LICENSE		LICENSE
RAG_system_QUICK_TOUR.pdf		RAG_system_QUICK_TOUR.pdf
README.md		README.md
UI_manager.py		UI_manager.py
requirements.txt		requirements.txt
setup_models.py		setup_models.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG SYSTEM - README

1) GENERAL DESCRIPTION

2) DESIGN AND TECHNICAL CHOICES

Dual/Multi Index:

Retrieval Strategies:

Retrieval Mode & Knee Detection:

Re-ranking:

3) MODULE STRUCTURE

4) INSTALLATION

4.1.

4.2.

4.3.

5) HOW TO START

6) REFERENCES

7) FILE DESCRIPTIONS

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG SYSTEM - README

1) GENERAL DESCRIPTION

2) DESIGN AND TECHNICAL CHOICES

Dual/Multi Index:

Retrieval Strategies:

Retrieval Mode & Knee Detection:

Re-ranking:

3) MODULE STRUCTURE

4) INSTALLATION

4.1.

4.2.

4.3.

5) HOW TO START

6) REFERENCES

7) FILE DESCRIPTIONS

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages