-
Notifications
You must be signed in to change notification settings - Fork 0
Home
- Theoretical Introduction
- Architecture: The Two Pipelines
- Project Structure
- Package
preprocessing/- Text Transformation - Package
storage/- Storage Pipeline - Package
search/- Search Pipeline - Ranking Modes
- Entry Point:
main.py - Complete Execution Flow
- Usage Instructions
- Logging
- Testing
Information Retrieval (IR) is the discipline that studies how to retrieve relevant information from a collection of documents in response to a user's information need, expressed as a query.
A search engine is an IR system composed of two main flows:
- Storage Pipeline: prepares documents for search (collects them, cleans them, indexes them, and saves the index to disk).
- Search Pipeline: receives a query from the user, processes it, and compares it against the index to return the most relevant documents.
The central data structure of any search engine is the inverted index.
While a forward index maps document -> list of terms, the inverted index does the
opposite: it maps term -> list of documents that contain it.
Forward Index: Inverted Index:
doc1 -> [cat, black, house] cat -> {doc1: 2, doc3: 1}
doc2 -> [house, white] black -> {doc1: 1}
doc3 -> [cat, white] house -> {doc1: 1, doc2: 1}
white -> {doc2: 1, doc3: 1}
Each entry of the inverted index is called a posting list: for a given term, it contains the list of documents in which it appears along with the Term Frequency (TF), i.e., how many times the term occurs in that document. This allows answering a query in time proportional to the number of relevant documents, not the total number of documents in the collection.
Documents and queries cannot be compared as-is. A preprocessing phase is needed to transform raw text into a normalized and comparable form. The typical operations are:
- Tokenization: splitting the text into minimal units (tokens/words).
- Normalization: standardizing token form (e.g., lowercase conversion).
- Stop word removal: eliminating words that are too common to carry useful information (e.g., "the", "is", "il", "di").
- Stemming: reducing each word to its morphological root, so that variants like "running", "runs", "ran" all collapse to the same stem "run".
In keyword search, matching is boolean: a document either matches or does not. With multiple terms, logical operators can be used:
- AND: the document must contain all the terms.
- OR: the document must contain at least one of the terms.
Even in a boolean model, results can be ranked by relevance.
Term Frequency (TF) -- The simplest method: a document's score is the sum of occurrences of all searched terms. A document where the term appears 7 times is considered more relevant than one where it appears once.
Score_TF(doc, query) = SUM( TF(term, doc) )
TF-IDF (Term Frequency - Inverse Document Frequency) -- An evolution of TF that accounts for the rarity of a term across the collection. The intuition is: a term that appears in few documents is more discriminating than one that appears everywhere.
IDF(term) = log( N / DF(term) )
Score_TFIDF(doc, query) = SUM( TF(term, doc) * IDF(term) )
Where:
- N = total number of documents in the collection
- DF(t) = number of documents containing term t (Document Frequency)
- IDF(t) = log(N/DF) -- high for rare terms, low for common terms
Example: in a collection of 15 documents, the term "planet" appears in 4 documents (DF=4, IDF=log(15/4)=1.32), while "mercury" appears in only 1 document (DF=1, IDF=log(15/1)=2.71). In TF-IDF ranking, a match on "mercury" is worth roughly twice a match on "planet" at equal TF.
BM25 (Okapi BM25) -- The most widely used probabilistic ranking function in modern search engines. Compared to TF-IDF, BM25 introduces two fundamental improvements:
-
TF Saturation: in TF-IDF, a term's contribution grows linearly with frequency (TF=10 is worth twice TF=5). BM25 introduces a saturation curve: beyond a certain point, additional occurrences of the term yield diminishing returns. This is controlled by the parameter k1 (default 1.5).
-
Document Length Normalization: longer documents naturally tend to have higher TF. BM25 penalizes long documents and rewards short ones, relative to the average length of the collection. This is controlled by the parameter b (default 0.75).
IDF_BM25(t) = log( (N - DF(t) + 0.5) / (DF(t) + 0.5) + 1 )
Score_BM25(doc, query) = SUM(
IDF_BM25(t) * TF(t,d) * (k1 + 1) / (TF(t,d) + k1 * (1 - b + b * |d| / avgdl))
)
Where:
- |d| = length of document d (number of tokens after preprocessing)
- avgdl = average length of all documents in the collection
- k1 = 1.5 -- controls TF saturation. With k1=0, TF is ignored entirely (pure boolean); with very high k1, it behaves like TF-IDF.
- b = 0.75 -- controls length normalization. With b=0, length has no effect; with b=1, normalization is maximal.
The BM25 IDF formula is slightly different from the standard one: it adds +1 to avoid negative values when a term appears in more than half the documents.
These parameter values (k1=1.5, b=0.75) originate from Robertson et al. (1994, TREC-3) and are the de facto standard in IR literature, performing well across diverse collections without per-collection tuning.
The Vector Space Model (VSM) represents both documents and queries as vectors in a high-dimensional term space. Each dimension corresponds to a term in the vocabulary, and the weight along each dimension is the term's TF-IDF value.
The relevance of a document to a query is measured by the cosine similarity between their vectors:
Score_VSM(doc, query) = cos(theta) = (q . d) / (|q| * |d|)
Where q is the TF-IDF vector of the query (TF=1 for each present term) and d
is the TF-IDF vector of the document.
Unlike the plain TF-IDF mode (which sums TF*IDF), the VSM normalizes by vector magnitude, rewarding direction over magnitude. A short but focused document can score higher than a long one that mentions the terms many times but also contains many unrelated terms.
While keyword-based methods (boolean, TF-IDF, BM25, VSM) rely on exact term matching, dense embeddings take an entirely different approach: both documents and queries are encoded into fixed-size real-valued vectors (embeddings) by a pre-trained neural language model. Similarity is then computed via cosine similarity between the query vector and each document vector.
This approach captures semantic similarity learned from large text corpora, without requiring exact term overlap. For example, a query for "planets in space" can match a document about "celestial bodies in the solar system" even if none of the query words appear literally in the document.
This engine uses the paraphrase-multilingual-MiniLM-L12-v2 model from sentence-transformers, producing 384-dimensional embeddings with multilingual support (50+ languages, including Italian and English).
Dense mode is fundamentally different from keyword modes:
- It does not use the inverted index, stop word removal, stemming, or thesaurus.
- It operates on the raw document text and query string.
- Embeddings are pre-computed for all documents and cached on disk.
- Results are filtered by a configurable cosine similarity threshold (default 0.0) and limited by top-k (default 10).
Polysemy is the phenomenon where a single word has multiple meanings. "Bank" can mean a financial institution or a river bank; "python" can be a programming language or a snake; "mercury" can be the planet, the Roman god, or the chemical element. Polysemy is a fundamental challenge in IR because the engine cannot distinguish between meanings.
Query expansion is a technique to improve recall (the fraction of relevant documents that are found): synonyms and related terms are added to the query via a thesaurus. When searching for "snake", the thesaurus adds "serpent" and "reptile", thereby finding documents that use these alternative words.
Note: query expansion increases recall but can lower precision (the fraction of found documents that are actually relevant). For this reason, the thesaurus is toggleable by the user in the engine.
==================== STORAGE PIPELINE ====================
.txt files --> Data Collection --> Preprocessing --> Indexing --> Storage (JSON)
(documents/) (reading) (tokenize, (inverted (save
normalize, index to disk)
stopwords, with TF)
stemming)
==================== SEARCH PIPELINE ====================
User Query --> Parsing --> [Thesaurus] --> Preprocessing --> Representation
("planet mars") (extract (synonym (same (structured
terms + expansion, pipeline as dataclass)
AND/OR) optional) documents)
--> Comparison --> Ranking --> Output
(match vs (TF, TF-IDF, (rich.Table
inverted BM25, VSM, formatted)
index) or Dense)
The fundamental principle is that the query must undergo the same preprocessing as the documents. If during indexing "planets" becomes the stem "planet", then the query "planets" must also become "planet" in order to match.
For dense mode, the pipeline is different: the raw query string is encoded directly by the sentence-transformers model and compared against pre-computed document embeddings via cosine similarity, bypassing the keyword pipeline entirely.
didactic_search_engine/
|
|-- documents/ 15 .txt files (5 IT, 5 EN, 5 polysemic)
|-- index/ Persistent index (inverted_index.json,
| embeddings.json)
|
|-- preprocessing/ Package: text transformation
| |-- __init__.py
| |-- tokenizer.py Stage 2: tokenization
| |-- normalizer.py Stage 3: normalization
| |-- stopword_remover.py Stage 4: stop word removal
| |-- stemmer.py Stage 5: stemming + language detection
| |-- pipeline.py Orchestrator for stages 2-5
| +-- thesaurus.py Bilingual query expansion with synonyms
|
|-- storage/ Package: storage pipeline
| |-- __init__.py
| |-- data_collection.py Stage 1: document collection from filesystem
| |-- indexing.py Stage 6: inverted index construction
| |-- persistence.py Stage 7: save/load JSON (index + embeddings)
| +-- pipeline.py Orchestrator (with rich spinner)
|
|-- search/ Package: search pipeline
| |-- __init__.py
| |-- query_parser.py Stage 1: query parsing
| |-- query_preprocessor.py Stage 2: preprocessing + thesaurus
| |-- query_representation.py Stage 3: internal representation
| |-- comparison.py Stage 4: query vs index comparison
| |-- ranking.py Stage 5: dispatcher to ranking modes
| |-- result_formatter.py Stage 6: formatting with rich.Table
| |-- pipeline.py Orchestrator (mode + thesaurus aware)
| +-- modes/ Sub-package: ranking algorithms
| |-- __init__.py Registry of available modes
| |-- boolean.py Boolean mode: ranking by pure TF
| |-- tfidf.py TF-IDF mode: ranking by TF-IDF
| |-- bm25.py BM25 mode: ranking by Okapi BM25
| |-- vsm.py VSM mode: Vector Space Model + cosine
| +-- dense.py Dense mode: sentence-transformers embeddings
|
|-- main.py Entry point: interactive CLI with rich
|-- logging_config.py Centralized logging configuration
|-- test.py Test suite (32 tests)
|-- requirements.txt Dependencies
|-- DOCUMENTAZIONE.md Documentation (Italian)
+-- documentation.md Documentation (English, this file)
This package implements stages 2-5 of the storage pipeline. It is shared: it is used both to preprocess documents (during indexing) and to preprocess queries (during search).
Theory: Tokenization is the first step of any NLP pipeline. It consists of splitting a continuous stream of characters into discrete units called tokens. In a search engine, each token typically corresponds to a word.
Implementation: Uses a regex [a-zA-ZA'-y0-9]+ that extracts sequences of
alphanumeric characters, correctly handling accented characters (e', a', u')
typical of Italian. The apostrophe is used as a separator, so
"l'intelligenza" becomes ["l", "intelligenza"].
Function: tokenize(text) -> list[str]
Theory: Normalization reduces superficial variants of tokens to a canonical form. The most common case is lowercase conversion: without normalization, "Rome", "rome", and "ROME" would be three different tokens and a query for "rome" would not find a document containing "Rome".
Implementation: Converts to lowercase and discards single-character tokens (truncated articles like "l", "d", isolated letters), which carry no informational value in search.
Function: normalize(tokens) -> list[str]
Theory: Stop words are extremely frequent words in a language that carry no informational content: articles ("the", "il"), prepositions ("of", "di"), conjunctions ("and", "e"), auxiliary verbs ("is", "e'"). Removing them reduces the index size and improves precision: without this step, almost every document would contain "the" and "il", making these terms useless for distinguishing documents from each other.
Implementation: Loads stop word lists from NLTK for Italian and English (merging them into a single set) and filters the tokens. The combined IT+EN set correctly handles documents in both languages.
Functions: get_stopwords(languages) -> set, remove_stopwords(tokens) -> list[str]
Theory: Stemming reduces each word to its morphological root (stem). The goal is that inflected forms of the same word ("running", "runs", "ran", "runner") collapse to the same stem ("run"), increasing recall (the ability to find relevant documents even when they use different forms of the same word).
Several stemming algorithms exist. This project uses Snowball Stemmer (also known as Porter2), a rule-based suffix-stripping algorithm available in NLTK for many languages including Italian and English.
Stemming is an approximation: it does not produce real words (e.g., "intelligence" becomes "intellig"), but it is computationally lightweight and sufficient for matching. The more sophisticated alternative would be lemmatization, which produces real words ("intelligence" -> "intelligence") but requires full grammatical analysis.
Language detection: A heuristic function counts how many Italian vs English stop words appear in the text and chooses the language with more matches. This determines which stemmer to apply, since suffix-stripping rules are language-specific.
Functions: detect_language(text) -> str, stem(tokens, language) -> list[str]
Theory: The thesaurus is a synonym dictionary used for query expansion. When the user searches for "car", the thesaurus adds "automobile" and "vehicle" to the query, increasing the probability of finding relevant documents that use different words for the same concept. Expansion happens BEFORE preprocessing, so the added synonyms pass through the same pipeline (tokenization, stemming, etc.).
Implementation: Bilingual dictionary (IT + EN) with approximately 40 entries organized by thematic area: transport, computing, space, science, nature, etc. Each term maps to a list of synonyms. "Polysemy bridges" connect ambiguous terms to their alternative meanings (e.g., "python" -> "serpent", "mercury" -> "quicksilver").
Function: expand_terms(terms, enabled) -> (expanded_terms, expansion_log)
The expansion_log records which expansions were applied, useful for showing
the user how the query was modified.
Theory: In a real IR system, the preprocessing stages form a pipeline: the output of each stage becomes the input of the next. The order matters: normalization must precede stop word removal (because stop word lists are in lowercase), and stemming must be applied last (because it operates on already-cleaned words).
Implementation: Calls in sequence tokenize -> normalize -> remove_stopwords -> stem.
Also includes preprocess_with_details() for debugging, which shows the intermediate
output of each stage.
Functions: preprocess(text) -> list[str], preprocess_with_details(text) -> dict
Theory: The first stage of any search engine is document collection (crawling/collection). In web search engines this happens via crawlers that navigate the web; in this educational project, it amounts to reading files from a local directory.
Implementation: Scans the documents/ directory, reads each .txt file
in alphabetical order, and returns a list of dictionaries, each with:
-
doc_id: unique identifier derived from the filename (e.g.,"01_solar_system") -
filename: original filename -
content: full text content
The 15 documents include 5 Italian texts, 5 English texts, and 5 polysemic texts that exploit word ambiguity (bank, python, mars, cell, mercury).
Function: collect_documents(directory_path) -> list[dict]
Theory: Indexing is the core of the storage pipeline. For each document, the text is preprocessed and then each term is inserted into the inverted index along with its Term Frequency (TF), i.e., how many times it appears in that document.
The resulting data structure is:
{
"stemmed_term": {
"doc_id_1": occurrence_count,
"doc_id_2": occurrence_count,
...
},
...
}The function also returns doc_lengths, a dictionary {doc_id: num_tokens} with the
length of each document (needed for BM25 computation).
Functions: build_inverted_index(documents) -> (dict, dict), get_index_stats(index) -> dict
Theory: In a real system, the index is not rebuilt on every startup: it is
serialized to disk and reloaded. JSON format was chosen for readability
(you can inspect the file index/inverted_index.json with any text editor).
In a production system, more efficient binary formats would be used.
The JSON index file contains three sections: metadata, inverted_index, and
doc_lengths (document lengths for BM25).
Embeddings for dense mode are stored separately in index/embeddings.json.
Functions: save_index(), load_index(), index_exists(), build_metadata(),
save_embeddings(), load_embeddings()
Combines all stages into a single executable pipeline with visual feedback
via rich.console.status() (animated spinner). Handles caching logic:
if the index already exists on disk and a forced rebuild has not been requested,
it loads it directly.
Also provides build_and_save_embeddings() for generating dense embeddings
on demand (called the first time dense mode is used, or via +rebuild dense).
Functions: run_storage_pipeline(docs_dir, index_path, force_rebuild) -> dict,
build_and_save_embeddings(documents) -> dict
Supported syntax:
-
"cat black"-- AND implicit (default) -
"cat && black"-- AND explicit (symbol) -
"cat || black"-- OR explicit (symbol) -
"cat OR black"-- OR inline (keyword) -
"AND:cat black"-- AND with explicit prefix -
"OR:cat black"-- OR with explicit prefix
Function: parse_query(raw_input) -> dict
Theory: The query must undergo the same preprocessing as the documents. With the addition of the thesaurus, the preprocessor now has three phases:
- Query Expansion (optional): if the thesaurus is active, expands terms with synonyms BEFORE standard preprocessing.
- Standard Preprocessing: tokenization, normalization, stop word removal.
- Bilingual Stemming: for each term, produces stems in both languages (IT + EN) and groups the variants to handle the bilingual collection.
Function: preprocess_query(parsed_query, thesaurus_enabled) -> dict
Frozen dataclass QueryRepresentation with:
-
original_query: the user's original string -
processed_terms: tuple of all unique stems (for OR) -
term_variants: tuple of frozensets, variants per term (for AND) -
operator: "AND" or "OR" -
detected_language: detected language
Function: build_query(preprocessed_query) -> QueryRepresentation
In the boolean model:
- AND with variants: for each original term, the system has a set of stem variants. A document matches a term if it contains at least one variant. Then the intersection across groups is computed.
- OR: union of all posting lists across all stems.
Function: compare(query, inverted_index) -> dict
Ranking has been separated into a dispatcher and independent modules.
The dispatcher ranking.py receives the mode parameter and routes the call
to the appropriate module in search/modes/.
For keyword modes (boolean, tfidf, bm25, vsm), it uses the standard interface:
rank_results(matches, inverted_index, num_documents, **kwargs) -> list[dict]
For dense mode, it uses a separate interface:
rank_results(query_string, doc_embeddings, documents, threshold, top_k) -> list[dict]
Function: rank_results(...) -> list[dict], get_available_modes() -> list[str]
Formats results into a colored rich.Table. Adapts to the mode:
- boolean: shows only TF per term
- tfidf: shows TF, DF, IDF, and TF*IDF contribution per term
- bm25: shows TF, DF, IDF, |d|, avgdl, and BM25 contribution per term
- vsm: shows TF, DF, IDF, TF-IDF doc/query values, and cosine score
- dense: shows cosine similarity, embedding dimensions, and model name
Two functions are provided:
-
format_results()for keyword modes (receivesQueryRepresentation) -
format_results_dense()for dense mode (receives the raw query string)
Both print directly to the console.
Composes all stages and manages configurable parameters:
-
mode: "boolean", "tfidf", "bm25", "vsm", or "dense" (ranking type) -
thesaurus_enabled: True/False (query expansion, keyword modes only) -
doc_lengths: {doc_id: num_token} (needed for bm25) -
doc_embeddings: {doc_id: [float, ...]} (needed for dense) -
threshold: minimum cosine similarity for dense (default 0.0) -
top_k: maximum number of results for dense (default 10, 0 = all) -
silent: if True, suppresses console output (used by benchmark)
For dense mode, the pipeline bypasses keyword stages (parsing, preprocessing, comparison) and passes the raw query string directly to the dense ranker.
Returns a dict with expansion_log, found, ranked, timing_ms, and warning.
Function: run_search_pipeline(...) -> dict
All ranking modes are located in the search/modes/ sub-package. Each keyword mode
implements the same interface:
rank_results(matches, inverted_index, num_documents, **kwargs) -> list[dict]
The mode registry in search/modes/__init__.py maintains the AVAILABLE_MODES
dictionary with the name and description of each mode.
Score(doc, query) = SUM( TF(term, doc) )
The simplest ranking. A document's score is the sum of term frequencies for all matched terms. Does not distinguish between rare and common terms.
IDF(t) = log( N / DF(t) )
Score(doc, query) = SUM( TF(t, doc) * IDF(t) )
Penalizes very common terms (low IDF) and rewards rare terms (high IDF).
Each result includes a details field with the per-term breakdown: TF, DF, IDF,
and TF*IDF contribution.
IDF_BM25(t) = log( (N - DF + 0.5) / (DF + 0.5) + 1 )
Score(doc, query) = SUM( IDF(t) * TF * (k1+1) / (TF + k1 * (1 - b + b * |d| / avgdl)) )
Parameters: K1 = 1.5, B = 0.75 (Robertson et al., 1994 - TREC-3).
The most advanced keyword ranking: saturates the TF contribution (diminishing returns)
and normalizes for document length. Requires doc_lengths as an additional parameter.
Each result includes a details field with: TF, DF, IDF, |d| (document length),
avgdl (average length), and BM25 contribution per term.
K1 controls TF saturation:
- k1 = 0 results in pure boolean behavior (TF is ignored)
- k1 = 1.5 is the standard value, good balance for most collections
- very high k1 approaches linear TF-IDF behavior (no saturation)
B controls document length normalization:
- b = 0 means length has no effect
- b = 0.75 is the standard value with moderate normalization
- b = 1 means full normalization relative to average document length
Score(doc, query) = cos(theta) = (q . d) / (|q| * |d|)
Where q is the query's TF-IDF vector (TF=1 for each present term, weighted by IDF)
and d is the document's TF-IDF vector.
Unlike the plain TF-IDF mode (which sums TF*IDF), VSM normalizes by vector magnitude, rewarding direction over magnitude. A short but focused document can score higher than a long one that mentions the terms many times but contains many other terms.
Results include query_vector, doc_vector, and terms_order fields for inspection,
as well as per-term details with TF, DF, IDF, and TF-IDF values for both document
and query.
Features an IDF cache keyed on (id(inverted_index), num_documents) to avoid
recomputing IDF values across successive searches when the index has not changed.
Cosine similarity scores are always in the range [0, 1] for non-negative TF-IDF vectors.
Dense mode uses sentence-transformers to encode documents and queries into 384-dimensional dense vectors and computes cosine similarity.
Model: paraphrase-multilingual-MiniLM-L12-v2
- 384 dimensions
- Multilingual: supports 50+ languages (including Italian and English)
- Pre-trained on paraphrase data for semantic similarity
Key characteristics:
- The model is lazy-loaded and thread-safe: it is loaded only on first use, with a double-checked locking pattern to avoid concurrent loading.
- Document embeddings are pre-computed via
build_embeddings()and cached on disk inindex/embeddings.json. - Results are filtered by a threshold (minimum cosine similarity, default 0.0) and limited by top-k (maximum number of results, default 10, 0 = all).
- This mode does not use the inverted index, stop words, stemming, or thesaurus. It operates on the raw document text and query string.
Functions: build_embeddings(documents) -> dict, rank_results(query_string, doc_embeddings, documents, threshold, top_k) -> list[dict]
The main.py file is the program's entry point. It uses the rich module
for a colorful CLI with banner, tables, and spinners. It handles:
-
Logging setup: calls
setup_logging()fromlogging_config.py. -
Initial banner:
rich.Panelwith title and available modes. -
NLTK setup: automatically downloads required resources if not present.
-
Storage Pipeline execution: loads the index from disk (with spinner) or builds it from scratch.
-
Interactive CLI loop with mode-aware prompt:
-
[boolean|T] search>-- boolean mode with thesaurus active (default) -
[tfidf] search>-- tfidf mode without thesaurus -
[bm25|T] search>-- bm25 mode with thesaurus active -
[vsm] search>-- vsm mode -
[dense] search>-- dense mode
-
-
Available commands (prefix
+):
| Command | Description |
|---|---|
+help |
Show all available commands |
+mode boolean|tfidf|bm25|vsm|dense |
Change ranking algorithm |
+thesaurus |
Toggle on/off query expansion |
+thesaurus on|off |
Explicitly enable/disable query expansion |
+d<N> |
Show the full content of document N from the last search |
+d<N> <term> |
Show document N with occurrences of <term> highlighted |
+index |
Show the complete inverted index |
+index <term> |
Filter the inverted index by term (e.g., +index planet) |
+threshold <value> |
Set minimum cosine threshold for dense (default 0.0, range -1..1) |
+topk <N> |
Set maximum results for dense (default 10, 0 = all) |
+bench |
Benchmark: 100 queries x all modes, table sorted by speed |
+plot |
2D vector proximity chart (matplotlib, external window) |
+plot3d |
3D vector proximity chart (matplotlib, rotatable) |
+plotext |
2D vector proximity chart (in terminal, via plotext) |
+info |
Index statistics + current configuration |
+rebuild |
Rebuild the index from scratch |
+rebuild dense |
Rebuild/generate dense embeddings |
+clear |
Clear the screen |
+exit / +quit
|
Exit the program |
- Search operators:
| Syntax | Description |
|---|---|
cat black |
AND implicit (default) |
cat && black |
AND explicit (symbol) |
cat || black |
OR explicit (symbol) |
cat OR black |
OR inline (keyword) |
OR:cat black sun |
OR with prefix (all terms) |
AND:cat black |
AND with prefix |
[boolean] search> planet
Score = SUM(TF)
doc1: TF(planet)=7 -> score=7
doc2: TF(planet)=3 -> score=3
doc3: TF(planet)=1 -> score=1
[tfidf] search> planet
IDF(planet) = log(15/4) = 1.322
Score = TF * IDF
doc1: 7 * 1.322 = 9.254
doc2: 3 * 1.322 = 3.966
doc3: 1 * 1.322 = 1.322
[bm25] search> planet
IDF_BM25(planet) = log((15-4+0.5)/(4+0.5)+1) = 1.138
|d1|=141, |d2|=158, avgdl=161.6
doc1: IDF * 7*2.5 / (7 + 1.5*(1-0.75+0.75*141/161.6)) = 3.887
doc2: IDF * 1*2.5 / (1 + 1.5*(1-0.75+0.75*158/161.6)) = 1.875
[vsm] search> planet
Both query and document are represented as TF-IDF vectors.
Score = cosine similarity between vectors.
doc1: cos=0.65 (high TF for "planet", focused content)
doc2: cos=0.32 (lower TF, more diverse vocabulary)
[dense] search> planet
Query encoded by paraphrase-multilingual-MiniLM-L12-v2 (384-dim).
Score = cosine similarity between query embedding and document embedding.
Captures semantic meaning: finds documents about celestial bodies
even without exact keyword match.
Note how BM25 "compresses" score differences compared to TF-IDF: doc1 with TF=7 does not have a 7x score over doc2 with TF=1 (saturation).
Note how VSM normalizes by vector magnitude: a short, focused document can outrank a longer one with higher raw TF.
[boolean|T] search> snake
Thesaurus: snake -> serpent, reptile
The query becomes: ["snake", "serpent", "reptile"]
Finds documents that use "serpent" even if they do not contain "snake"
[tfidf] search> python
Finds both the document about Python programming and the one about
the python snake. TF-IDF ranks by relevance based on term frequency
and rarity in the collection.
[dense] search> python programming
Dense mode captures semantic meaning: the programming document scores
significantly higher because the embedding understands the semantic
association between "python" and "programming".
[bm25|T] search> serpente
Thesaurus: serpente -> snake, serpent, reptile
Cross-language expansion finds the English document about the python snake
even though the query is in Italian.
pip install -r requirements.txtDependencies:
-
nltk>=3.8-- natural language processing (tokenization, stemming, stop words) -
rich>=13.0-- colored CLI output (tables, panels, spinners) -
matplotlib>=3.7-- 2D/3D vector proximity plots -
scikit-learn>=1.3-- PCA dimensionality reduction for plots -
sentence-transformers>=2.2-- dense embeddings (for dense mode) -
plotext>=5.2-- in-terminal 2D plots
python main.pyOn first run, the system downloads the required NLTK resources and builds the index. Subsequent runs load the index from disk.
[boolean|T] search> planet # search "planet" (thesaurus ON by default)
[boolean|T] search> artificial intelligence # AND: both terms required
[boolean|T] search> planet || mars # OR: at least one term
[boolean|T] search> snake && python # AND explicit
[boolean|T] search> +thesaurus # toggle -> OFF
[boolean] search> python OR learning # OR alternative without expansion
[boolean] search> +mode tfidf # switch to TF-IDF
[tfidf] search> planet # same query, different ranking
[tfidf] search> +mode bm25 # switch to BM25
[bm25] search> planet # BM25: saturates TF, normalizes length
[bm25] search> +thesaurus # toggle -> ON
[bm25|T] search> serpente # cross-language: finds python_the_snake!
[bm25|T] search> +mode vsm # switch to VSM
[vsm|T] search> planet sun # cosine similarity between vectors
[vsm|T] search> +mode dense # switch to Dense (loads model on first use)
[dense] search> planets in our solar system # semantic search, no keyword matching
[dense] search> +threshold 0.3 # set minimum cosine threshold
[dense] search> +topk 5 # limit to top 5 results
[dense] search> +d1 # show full text of result #1
[dense] search> +d1 planet # show result #1 with "planet" highlighted
[dense] search> +info # show stats + configuration
[dense] search> +bench # benchmark all modes (100 queries each)
[dense] search> +plot # 2D TF-IDF vector proximity chart
[dense] search> +plot3d # 3D vector proximity chart (rotatable)
[dense] search> +plotext # 2D chart in terminal
[dense] search> +index planet # inspect inverted index for "planet"
[dense] search> +rebuild # rebuild inverted index from scratch
[dense] search> +rebuild dense # regenerate dense embeddings
[dense] search> +quit # exit
The collection includes 5 documents designed to demonstrate polysemy:
- 11_banks_and_rivers.txt -- "bank" as river bank (EN)
- 12_python_the_snake.txt -- "python" as a snake (EN)
- 13_mars_mythology.txt -- "mars" as Roman god + Mars candy bar (IT)
- 14_cells_biology_prison.txt -- "cell" in biology vs prison (EN)
- 15_mercury_element.txt -- "mercury" as chemical element + god (IT)
Try searching for "python", "mars", "mercury" to see how the engine handles terms with multiple meanings.
The engine provides three visualization commands that plot all 15 documents and the query as points in a reduced vector space (via PCA on TF-IDF vectors):
-
+plot: Opens a 2D matplotlib chart in an external window. Found documents are shown in green, unfound documents in gray, and the query as a red star. Dashed lines connect the query to found documents. -
+plot3d: Same as above but in 3D (rotatable). -
+plotext: Renders a 2D chart directly in the terminal using plotext.
These visualizations help understand how documents cluster in vector space and why certain documents are closer to the query than others.
The +bench command runs 100 queries (90 deterministic + 10 random) across all
available modes and displays a table sorted by average retrieval time.
It measures both the pure retrieval time (comparison + ranking) and the full
pipeline time (including parsing, preprocessing, and formatting).
If dense embeddings are not available, the dense mode is excluded from the benchmark
with a notification to run +rebuild dense first.
The project uses Python's standard logging module, configured centrally
in logging_config.py.
The setup_logging() function configures the root logger with two handlers:
| Handler | Level | Destination | Purpose |
|---|---|---|---|
| FileHandler | DEBUG | engine.log |
Captures everything for post-mortem analysis |
| StreamHandler | WARNING | console (stderr) | Only warnings and errors (Rich handles normal UI) |
Log format:
- File:
2024-01-15 14:30:22 [INFO ] search.pipeline: Ricerca completata: 3 risultati in 1.23 ms - Console:
[WARNING] message
Every module that needs logging follows the standard pattern:
import logging
logger = logging.getLogger(__name__)
logger.info("Index loaded with %d terms", num_terms)
logger.debug("Query preprocessing: %s -> %s", raw, processed)
logger.warning("Embeddings not available")Using __name__ ensures that log messages include the fully qualified module name
(e.g., search.pipeline, storage.indexing, search.modes.dense), making it
easy to trace where each message originated.
The engine.log file is created in the project root directory and contains
all messages at DEBUG level and above. It is useful for:
- Debugging preprocessing issues (which stems were generated)
- Tracking search performance (timing data)
- Diagnosing model loading issues (sentence-transformers)
- Understanding the storage pipeline flow
The file handler uses UTF-8 encoding to correctly handle Italian text in log messages.
setup_logging() checks for existing handlers before adding new ones, ensuring
it can be called multiple times without creating duplicate handlers.
The test suite is located in test.py and contains 32 tests covering all
components of the project.
python test.pyThe test runner produces compact output with one line per test:
MOTORE DI RICERCA - TEST (32 test)
[01/32] PASS test_tokenize
[02/32] PASS test_normalize
...
[23/32] SKIP test_dense_build_embeddings (sentence-transformers non installato)
[24/32] SKIP test_dense_ranking (sentence-transformers non installato)
...
[32/32] PASS test_storage_pipeline_completa
RISULTATO: 29/32 passati, 0 falliti, 3 saltati
Log scritto in test.log
Failed tests show the full traceback inline. A test.log file is generated
with the same format as the console output.
| Tests | Category | Description |
|---|---|---|
| 01-09 | Preprocessing | tokenize, normalize, get_stopwords, remove_stopwords, detect_language, stem, preprocess, preprocess_with_details, thesaurus_expand |
| 10-13 | Storage | collect_documents, build_inverted_index, persistence (index), persistence (embeddings) |
| 14-18 | Search Core | parse_query, query_preprocessor, query_representation, compare |
| 19-22 | Ranking Modes | boolean_ranking, tfidf_ranking, bm25_ranking, vsm_ranking |
| 23-24 | Dense Embeddings | dense_build_embeddings, dense_ranking |
| 25-28 | Dispatcher/Registry | ranking_dispatcher_tutti, modes_registry_completo, format_results_vsm, format_results_dense |
| 29-31 | Integration | search_pipeline_vsm, search_pipeline_dense, search_pipeline_keyword |
| 31-32 | End-to-end | plot_data_builder, storage_pipeline_completa |
Tests that require sentence-transformers (tests 23-24 and 30) are automatically
marked as SKIP if the library is not installed, rather than failing. This allows
the test suite to pass on environments without GPU or without the heavy
sentence-transformers dependency.
The test suite verifies:
- Preprocessing correctness: tokenization handles accented characters, normalization removes single-character tokens, stop words are removed for both IT and EN, stemming produces expected stems, language detection works for both languages.
- Thesaurus expansion: "snake" expands to include "serpente" (cross-language bridge), disabled thesaurus produces no expansion.
- Storage integrity: 15 documents are collected in alphabetical order, inverted index is non-empty, doc_lengths are positive, persistence round-trips correctly (save then load).
- Query parsing: all syntax variants (AND/OR implicit, explicit, prefix, keyword) are correctly parsed.
- Ranking correctness: results are sorted by score descending, IDF values follow the expected formula, BM25 exhibits TF saturation (TF=10 vs TF=1 ratio is less than 10), VSM cosine scores are in [0, 1].
- Dense embeddings: embeddings are 384-dimensional, semantic search ranks "planets in space" with the solar system document first, threshold and top-k filtering work correctly.
- Mode registry: all 5 modes (boolean, tfidf, bm25, vsm, dense) are registered, unknown modes raise ValueError.
- Integration: full pipeline produces results for all modes, thesaurus cross-language search works end-to-end, empty queries return found=False.
- Plot data: PCA reduction produces correct shapes (16 points = 15 docs + 1 query), query is labeled "QUERY", found documents are correctly identified.
The test runner exits with code 0 if all tests pass (or are skipped), and code 1 if any test fails, making it suitable for CI/CD integration.