Home

Didactic Search Engine

1. Theoretical Introduction

What Is Information Retrieval

Information Retrieval (IR) is the discipline that studies how to retrieve relevant information from a collection of documents in response to a user's information need, expressed as a query.

A search engine is an IR system composed of two main flows:

Storage Pipeline: prepares documents for search (collects them, cleans them, indexes them, and saves the index to disk).
Search Pipeline: receives a query from the user, processes it, and compares it against the index to return the most relevant documents.

The Inverted Index

The central data structure of any search engine is the inverted index. While a forward index maps document -> list of terms, the inverted index does the opposite: it maps term -> list of documents that contain it.

Forward Index:                     Inverted Index:
  doc1 -> [cat, black, house]        cat   -> {doc1: 2, doc3: 1}
  doc2 -> [house, white]             black -> {doc1: 1}
  doc3 -> [cat, white]               house -> {doc1: 1, doc2: 1}
                                     white -> {doc2: 1, doc3: 1}

Each entry of the inverted index is called a posting list: for a given term, it contains the list of documents in which it appears along with the Term Frequency (TF), i.e., how many times the term occurs in that document. This allows answering a query in time proportional to the number of relevant documents, not the total number of documents in the collection.

Text Preprocessing

Documents and queries cannot be compared as-is. A preprocessing phase is needed to transform raw text into a normalized and comparable form. The typical operations are:

Tokenization: splitting the text into minimal units (tokens/words).
Normalization: standardizing token form (e.g., lowercase conversion).
Stop word removal: eliminating words that are too common to carry useful information (e.g., "the", "is", "il", "di").
Stemming: reducing each word to its morphological root, so that variants like "running", "runs", "ran" all collapse to the same stem "run".

Boolean Matching

In keyword search, matching is boolean: a document either matches or does not. With multiple terms, logical operators can be used:

AND: the document must contain all the terms.
OR: the document must contain at least one of the terms.

Ranking: From TF to TF-IDF to BM25

Even in a boolean model, results can be ranked by relevance.

Term Frequency (TF) -- The simplest method: a document's score is the sum of occurrences of all searched terms. A document where the term appears 7 times is considered more relevant than one where it appears once.

Score_TF(doc, query) = SUM( TF(term, doc) )

TF-IDF (Term Frequency - Inverse Document Frequency) -- An evolution of TF that accounts for the rarity of a term across the collection. The intuition is: a term that appears in few documents is more discriminating than one that appears everywhere.

IDF(term) = log( N / DF(term) )
Score_TFIDF(doc, query) = SUM( TF(term, doc) * IDF(term) )

Where:

N = total number of documents in the collection
DF(t) = number of documents containing term t (Document Frequency)
IDF(t) = log(N/DF) -- high for rare terms, low for common terms

Example: in a collection of 15 documents, the term "planet" appears in 4 documents (DF=4, IDF=log(15/4)=1.32), while "mercury" appears in only 1 document (DF=1, IDF=log(15/1)=2.71). In TF-IDF ranking, a match on "mercury" is worth roughly twice a match on "planet" at equal TF.

BM25 (Okapi BM25) -- The most widely used probabilistic ranking function in modern search engines. Compared to TF-IDF, BM25 introduces two fundamental improvements:

TF Saturation: in TF-IDF, a term's contribution grows linearly with frequency (TF=10 is worth twice TF=5). BM25 introduces a saturation curve: beyond a certain point, additional occurrences of the term yield diminishing returns. This is controlled by the parameter k1 (default 1.5).
Document Length Normalization: longer documents naturally tend to have higher TF. BM25 penalizes long documents and rewards short ones, relative to the average length of the collection. This is controlled by the parameter b (default 0.75).

IDF_BM25(t) = log( (N - DF(t) + 0.5) / (DF(t) + 0.5) + 1 )

Score_BM25(doc, query) = SUM(
    IDF_BM25(t) * TF(t,d) * (k1 + 1) / (TF(t,d) + k1 * (1 - b + b * |d| / avgdl))
)

Where:

|d| = length of document d (number of tokens after preprocessing)
avgdl = average length of all documents in the collection
k1 = 1.5 -- controls TF saturation. With k1=0, TF is ignored entirely (pure boolean); with very high k1, it behaves like TF-IDF.
b = 0.75 -- controls length normalization. With b=0, length has no effect; with b=1, normalization is maximal.

The BM25 IDF formula is slightly different from the standard one: it adds +1 to avoid negative values when a term appears in more than half the documents.

These parameter values (k1=1.5, b=0.75) originate from Robertson et al. (1994, TREC-3) and are the de facto standard in IR literature, performing well across diverse collections without per-collection tuning.

Vector Space Model

The Vector Space Model (VSM) represents both documents and queries as vectors in a high-dimensional term space. Each dimension corresponds to a term in the vocabulary, and the weight along each dimension is the term's TF-IDF value.

The relevance of a document to a query is measured by the cosine similarity between their vectors:

Score_VSM(doc, query) = cos(theta) = (q . d) / (|q| * |d|)

Where q is the TF-IDF vector of the query (TF=1 for each present term) and d is the TF-IDF vector of the document.

Unlike the plain TF-IDF mode (which sums TF*IDF), the VSM normalizes by vector magnitude, rewarding direction over magnitude. A short but focused document can score higher than a long one that mentions the terms many times but also contains many unrelated terms.

Dense Embeddings

While keyword-based methods (boolean, TF-IDF, BM25, VSM) rely on exact term matching, dense embeddings take an entirely different approach: both documents and queries are encoded into fixed-size real-valued vectors (embeddings) by a pre-trained neural language model. Similarity is then computed via cosine similarity between the query vector and each document vector.

This approach captures semantic similarity learned from large text corpora, without requiring exact term overlap. For example, a query for "planets in space" can match a document about "celestial bodies in the solar system" even if none of the query words appear literally in the document.

This engine uses the paraphrase-multilingual-MiniLM-L12-v2 model from sentence-transformers, producing 384-dimensional embeddings with multilingual support (50+ languages, including Italian and English).

Dense mode is fundamentally different from keyword modes:

It does not use the inverted index, stop word removal, stemming, or thesaurus.
It operates on the raw document text and query string.
Embeddings are pre-computed for all documents and cached on disk.
Results are filtered by a configurable cosine similarity threshold (default 0.0) and limited by top-k (default 10).

Polysemy and Query Expansion

Polysemy is the phenomenon where a single word has multiple meanings. "Bank" can mean a financial institution or a river bank; "python" can be a programming language or a snake; "mercury" can be the planet, the Roman god, or the chemical element. Polysemy is a fundamental challenge in IR because the engine cannot distinguish between meanings.

Query expansion is a technique to improve recall (the fraction of relevant documents that are found): synonyms and related terms are added to the query via a thesaurus. When searching for "snake", the thesaurus adds "serpent" and "reptile", thereby finding documents that use these alternative words.

Note: query expansion increases recall but can lower precision (the fraction of found documents that are actually relevant). For this reason, the thesaurus is toggleable by the user in the engine.

2. Architecture: The Two Pipelines

  ==================== STORAGE PIPELINE ====================

  .txt files --> Data Collection --> Preprocessing --> Indexing --> Storage (JSON)
  (documents/)    (reading)         (tokenize,         (inverted     (save
                                     normalize,         index         to disk)
                                     stopwords,         with TF)
                                     stemming)

  ==================== SEARCH PIPELINE ====================

  User Query --> Parsing --> [Thesaurus] --> Preprocessing --> Representation
  ("planet mars")  (extract     (synonym       (same            (structured
                    terms +      expansion,     pipeline as       dataclass)
                    AND/OR)      optional)      documents)

  --> Comparison --> Ranking --> Output
      (match vs      (TF, TF-IDF,  (rich.Table
       inverted       BM25, VSM,    formatted)
       index)         or Dense)

The fundamental principle is that the query must undergo the same preprocessing as the documents. If during indexing "planets" becomes the stem "planet", then the query "planets" must also become "planet" in order to match.

For dense mode, the pipeline is different: the raw query string is encoded directly by the sentence-transformers model and compared against pre-computed document embeddings via cosine similarity, bypassing the keyword pipeline entirely.

3. Project Structure

didactic_search_engine/
|
|-- documents/                        15 .txt files (5 IT, 5 EN, 5 polysemic)
|-- index/                            Persistent index (inverted_index.json,
|                                     embeddings.json)
|
|-- preprocessing/                    Package: text transformation
|   |-- __init__.py
|   |-- tokenizer.py                  Stage 2: tokenization
|   |-- normalizer.py                 Stage 3: normalization
|   |-- stopword_remover.py           Stage 4: stop word removal
|   |-- stemmer.py                    Stage 5: stemming + language detection
|   |-- pipeline.py                   Orchestrator for stages 2-5
|   +-- thesaurus.py                  Bilingual query expansion with synonyms
|
|-- storage/                          Package: storage pipeline
|   |-- __init__.py
|   |-- data_collection.py            Stage 1: document collection from filesystem
|   |-- indexing.py                   Stage 6: inverted index construction
|   |-- persistence.py                Stage 7: save/load JSON (index + embeddings)
|   +-- pipeline.py                   Orchestrator (with rich spinner)
|
|-- search/                           Package: search pipeline
|   |-- __init__.py
|   |-- query_parser.py               Stage 1: query parsing
|   |-- query_preprocessor.py         Stage 2: preprocessing + thesaurus
|   |-- query_representation.py       Stage 3: internal representation
|   |-- comparison.py                 Stage 4: query vs index comparison
|   |-- ranking.py                    Stage 5: dispatcher to ranking modes
|   |-- result_formatter.py           Stage 6: formatting with rich.Table
|   |-- pipeline.py                   Orchestrator (mode + thesaurus aware)
|   +-- modes/                        Sub-package: ranking algorithms
|       |-- __init__.py               Registry of available modes
|       |-- boolean.py                Boolean mode: ranking by pure TF
|       |-- tfidf.py                  TF-IDF mode: ranking by TF-IDF
|       |-- bm25.py                   BM25 mode: ranking by Okapi BM25
|       |-- vsm.py                    VSM mode: Vector Space Model + cosine
|       +-- dense.py                  Dense mode: sentence-transformers embeddings
|
|-- main.py                           Entry point: interactive CLI with rich
|-- logging_config.py                 Centralized logging configuration
|-- test.py                           Test suite (32 tests)
|-- requirements.txt                  Dependencies
|-- DOCUMENTAZIONE.md                 Documentation (Italian)
+-- documentation.md                  Documentation (English, this file)

4. Package `preprocessing/` - Text Transformation

This package implements stages 2-5 of the storage pipeline. It is shared: it is used both to preprocess documents (during indexing) and to preprocess queries (during search).

`tokenizer.py` - Stage 2: Tokenization

Theory: Tokenization is the first step of any NLP pipeline. It consists of splitting a continuous stream of characters into discrete units called tokens. In a search engine, each token typically corresponds to a word.

Implementation: Uses a regex [a-zA-ZA'-y0-9]+ that extracts sequences of alphanumeric characters, correctly handling accented characters (e', a', u') typical of Italian. The apostrophe is used as a separator, so "l'intelligenza" becomes ["l", "intelligenza"].

Function: tokenize(text) -> list[str]

`normalizer.py` - Stage 3: Normalization

Theory: Normalization reduces superficial variants of tokens to a canonical form. The most common case is lowercase conversion: without normalization, "Rome", "rome", and "ROME" would be three different tokens and a query for "rome" would not find a document containing "Rome".

Implementation: Converts to lowercase and discards single-character tokens (truncated articles like "l", "d", isolated letters), which carry no informational value in search.

Function: normalize(tokens) -> list[str]

`stopword_remover.py` - Stage 4: Stop Word Removal

Theory: Stop words are extremely frequent words in a language that carry no informational content: articles ("the", "il"), prepositions ("of", "di"), conjunctions ("and", "e"), auxiliary verbs ("is", "e'"). Removing them reduces the index size and improves precision: without this step, almost every document would contain "the" and "il", making these terms useless for distinguishing documents from each other.

Implementation: Loads stop word lists from NLTK for Italian and English (merging them into a single set) and filters the tokens. The combined IT+EN set correctly handles documents in both languages.

Functions: get_stopwords(languages) -> set, remove_stopwords(tokens) -> list[str]

`stemmer.py` - Stage 5: Stemming

Theory: Stemming reduces each word to its morphological root (stem). The goal is that inflected forms of the same word ("running", "runs", "ran", "runner") collapse to the same stem ("run"), increasing recall (the ability to find relevant documents even when they use different forms of the same word).

Several stemming algorithms exist. This project uses Snowball Stemmer (also known as Porter2), a rule-based suffix-stripping algorithm available in NLTK for many languages including Italian and English.

Stemming is an approximation: it does not produce real words (e.g., "intelligence" becomes "intellig"), but it is computationally lightweight and sufficient for matching. The more sophisticated alternative would be lemmatization, which produces real words ("intelligence" -> "intelligence") but requires full grammatical analysis.

Language detection: A heuristic function counts how many Italian vs English stop words appear in the text and chooses the language with more matches. This determines which stemmer to apply, since suffix-stripping rules are language-specific.

Functions: detect_language(text) -> str, stem(tokens, language) -> list[str]

`thesaurus.py` - Bilingual Query Expansion

Theory: The thesaurus is a synonym dictionary used for query expansion. When the user searches for "car", the thesaurus adds "automobile" and "vehicle" to the query, increasing the probability of finding relevant documents that use different words for the same concept. Expansion happens BEFORE preprocessing, so the added synonyms pass through the same pipeline (tokenization, stemming, etc.).

Implementation: Bilingual dictionary (IT + EN) with approximately 40 entries organized by thematic area: transport, computing, space, science, nature, etc. Each term maps to a list of synonyms. "Polysemy bridges" connect ambiguous terms to their alternative meanings (e.g., "python" -> "serpent", "mercury" -> "quicksilver").

Function: expand_terms(terms, enabled) -> (expanded_terms, expansion_log)

The expansion_log records which expansions were applied, useful for showing the user how the query was modified.

`pipeline.py` - Preprocessing Orchestrator

Theory: In a real IR system, the preprocessing stages form a pipeline: the output of each stage becomes the input of the next. The order matters: normalization must precede stop word removal (because stop word lists are in lowercase), and stemming must be applied last (because it operates on already-cleaned words).

Implementation: Calls in sequence tokenize -> normalize -> remove_stopwords -> stem. Also includes preprocess_with_details() for debugging, which shows the intermediate output of each stage.

Functions: preprocess(text) -> list[str], preprocess_with_details(text) -> dict

5. Package `storage/` - Storage Pipeline

`data_collection.py` - Stage 1: Data Collection

Theory: The first stage of any search engine is document collection (crawling/collection). In web search engines this happens via crawlers that navigate the web; in this educational project, it amounts to reading files from a local directory.

Implementation: Scans the documents/ directory, reads each .txt file in alphabetical order, and returns a list of dictionaries, each with:

doc_id: unique identifier derived from the filename (e.g., "01_solar_system")
filename: original filename
content: full text content

The 15 documents include 5 Italian texts, 5 English texts, and 5 polysemic texts that exploit word ambiguity (bank, python, mars, cell, mercury).

Function: collect_documents(directory_path) -> list[dict]

`indexing.py` - Stage 6: Inverted Index Construction

Theory: Indexing is the core of the storage pipeline. For each document, the text is preprocessed and then each term is inserted into the inverted index along with its Term Frequency (TF), i.e., how many times it appears in that document.

The resulting data structure is:

{
    "stemmed_term": {
        "doc_id_1": occurrence_count,
        "doc_id_2": occurrence_count,
        ...
    },
    ...
}

The function also returns doc_lengths, a dictionary {doc_id: num_tokens} with the length of each document (needed for BM25 computation).

Functions: build_inverted_index(documents) -> (dict, dict), get_index_stats(index) -> dict

`persistence.py` - Stage 7: Disk Persistence

Theory: In a real system, the index is not rebuilt on every startup: it is serialized to disk and reloaded. JSON format was chosen for readability (you can inspect the file index/inverted_index.json with any text editor). In a production system, more efficient binary formats would be used.

The JSON index file contains three sections: metadata, inverted_index, and doc_lengths (document lengths for BM25).

Embeddings for dense mode are stored separately in index/embeddings.json.

Functions: save_index(), load_index(), index_exists(), build_metadata(), save_embeddings(), load_embeddings()

`pipeline.py` - Storage Pipeline Orchestrator

Combines all stages into a single executable pipeline with visual feedback via rich.console.status() (animated spinner). Handles caching logic: if the index already exists on disk and a forced rebuild has not been requested, it loads it directly.

Also provides build_and_save_embeddings() for generating dense embeddings on demand (called the first time dense mode is used, or via +rebuild dense).

Functions: run_storage_pipeline(docs_dir, index_path, force_rebuild) -> dict, build_and_save_embeddings(documents) -> dict

6. Package `search/` - Search Pipeline

`query_parser.py` - Stage 1: Query Parsing

Supported syntax:

"cat black" -- AND implicit (default)
"cat && black" -- AND explicit (symbol)
"cat || black" -- OR explicit (symbol)
"cat OR black" -- OR inline (keyword)
"AND:cat black" -- AND with explicit prefix
"OR:cat black" -- OR with explicit prefix

Function: parse_query(raw_input) -> dict

`query_preprocessor.py` - Stage 2: Query Preprocessing

Theory: The query must undergo the same preprocessing as the documents. With the addition of the thesaurus, the preprocessor now has three phases:

Query Expansion (optional): if the thesaurus is active, expands terms with synonyms BEFORE standard preprocessing.
Standard Preprocessing: tokenization, normalization, stop word removal.
Bilingual Stemming: for each term, produces stems in both languages (IT + EN) and groups the variants to handle the bilingual collection.

Function: preprocess_query(parsed_query, thesaurus_enabled) -> dict

`query_representation.py` - Stage 3: Internal Representation

Frozen dataclass QueryRepresentation with:

original_query: the user's original string
processed_terms: tuple of all unique stems (for OR)
term_variants: tuple of frozensets, variants per term (for AND)
operator: "AND" or "OR"
detected_language: detected language

Function: build_query(preprocessed_query) -> QueryRepresentation

`comparison.py` - Stage 4: Query vs Index Comparison

In the boolean model:

AND with variants: for each original term, the system has a set of stem variants. A document matches a term if it contains at least one variant. Then the intersection across groups is computed.
OR: union of all posting lists across all stems.

Function: compare(query, inverted_index) -> dict

`ranking.py` - Stage 5: Ranking (Dispatcher)

Ranking has been separated into a dispatcher and independent modules. The dispatcher ranking.py receives the mode parameter and routes the call to the appropriate module in search/modes/.

For keyword modes (boolean, tfidf, bm25, vsm), it uses the standard interface: rank_results(matches, inverted_index, num_documents, **kwargs) -> list[dict]

For dense mode, it uses a separate interface: rank_results(query_string, doc_embeddings, documents, threshold, top_k) -> list[dict]

Function: rank_results(...) -> list[dict], get_available_modes() -> list[str]

`result_formatter.py` - Stage 6: Formatting with Rich

Formats results into a colored rich.Table. Adapts to the mode:

boolean: shows only TF per term
tfidf: shows TF, DF, IDF, and TF*IDF contribution per term
bm25: shows TF, DF, IDF, |d|, avgdl, and BM25 contribution per term
vsm: shows TF, DF, IDF, TF-IDF doc/query values, and cosine score
dense: shows cosine similarity, embedding dimensions, and model name

Two functions are provided:

format_results() for keyword modes (receives QueryRepresentation)
format_results_dense() for dense mode (receives the raw query string)

Both print directly to the console.

`pipeline.py` - Search Pipeline Orchestrator

Composes all stages and manages configurable parameters:

mode: "boolean", "tfidf", "bm25", "vsm", or "dense" (ranking type)
thesaurus_enabled: True/False (query expansion, keyword modes only)
doc_lengths: {doc_id: num_token} (needed for bm25)
doc_embeddings: {doc_id: [float, ...]} (needed for dense)
threshold: minimum cosine similarity for dense (default 0.0)
top_k: maximum number of results for dense (default 10, 0 = all)
silent: if True, suppresses console output (used by benchmark)

For dense mode, the pipeline bypasses keyword stages (parsing, preprocessing, comparison) and passes the raw query string directly to the dense ranker.

Returns a dict with expansion_log, found, ranked, timing_ms, and warning.

Function: run_search_pipeline(...) -> dict

7. Ranking Modes

All ranking modes are located in the search/modes/ sub-package. Each keyword mode implements the same interface: rank_results(matches, inverted_index, num_documents, **kwargs) -> list[dict]

The mode registry in search/modes/__init__.py maintains the AVAILABLE_MODES dictionary with the name and description of each mode.

`modes/boolean.py` - Boolean TF Ranking

Score(doc, query) = SUM( TF(term, doc) )

The simplest ranking. A document's score is the sum of term frequencies for all matched terms. Does not distinguish between rare and common terms.

`modes/tfidf.py` - TF-IDF Ranking

IDF(t) = log( N / DF(t) )
Score(doc, query) = SUM( TF(t, doc) * IDF(t) )

Penalizes very common terms (low IDF) and rewards rare terms (high IDF). Each result includes a details field with the per-term breakdown: TF, DF, IDF, and TF*IDF contribution.

`modes/bm25.py` - BM25 Ranking (Okapi BM25)

IDF_BM25(t) = log( (N - DF + 0.5) / (DF + 0.5) + 1 )
Score(doc, query) = SUM( IDF(t) * TF * (k1+1) / (TF + k1 * (1 - b + b * |d| / avgdl)) )

Parameters: K1 = 1.5, B = 0.75 (Robertson et al., 1994 - TREC-3).

The most advanced keyword ranking: saturates the TF contribution (diminishing returns) and normalizes for document length. Requires doc_lengths as an additional parameter. Each result includes a details field with: TF, DF, IDF, |d| (document length), avgdl (average length), and BM25 contribution per term.

K1 controls TF saturation:

k1 = 0 results in pure boolean behavior (TF is ignored)
k1 = 1.5 is the standard value, good balance for most collections
very high k1 approaches linear TF-IDF behavior (no saturation)

B controls document length normalization:

b = 0 means length has no effect
b = 0.75 is the standard value with moderate normalization
b = 1 means full normalization relative to average document length

`modes/vsm.py` - Vector Space Model (TF-IDF + Cosine Similarity)

Score(doc, query) = cos(theta) = (q . d) / (|q| * |d|)

Where q is the query's TF-IDF vector (TF=1 for each present term, weighted by IDF) and d is the document's TF-IDF vector.

Unlike the plain TF-IDF mode (which sums TF*IDF), VSM normalizes by vector magnitude, rewarding direction over magnitude. A short but focused document can score higher than a long one that mentions the terms many times but contains many other terms.

Results include query_vector, doc_vector, and terms_order fields for inspection, as well as per-term details with TF, DF, IDF, and TF-IDF values for both document and query.

Features an IDF cache keyed on (id(inverted_index), num_documents) to avoid recomputing IDF values across successive searches when the index has not changed.

Cosine similarity scores are always in the range [0, 1] for non-negative TF-IDF vectors.

`modes/dense.py` - Dense Embeddings (Semantic Search)

Dense mode uses sentence-transformers to encode documents and queries into 384-dimensional dense vectors and computes cosine similarity.

Model: paraphrase-multilingual-MiniLM-L12-v2

384 dimensions
Multilingual: supports 50+ languages (including Italian and English)
Pre-trained on paraphrase data for semantic similarity

Key characteristics:

The model is lazy-loaded and thread-safe: it is loaded only on first use, with a double-checked locking pattern to avoid concurrent loading.
Document embeddings are pre-computed via build_embeddings() and cached on disk in index/embeddings.json.
Results are filtered by a threshold (minimum cosine similarity, default 0.0) and limited by top-k (maximum number of results, default 10, 0 = all).
This mode does not use the inverted index, stop words, stemming, or thesaurus. It operates on the raw document text and query string.

Functions: build_embeddings(documents) -> dict, rank_results(query_string, doc_embeddings, documents, threshold, top_k) -> list[dict]

8. Entry Point: `main.py`

The main.py file is the program's entry point. It uses the rich module for a colorful CLI with banner, tables, and spinners. It handles:

Logging setup: calls setup_logging() from logging_config.py.
Initial banner: rich.Panel with title and available modes.
NLTK setup: automatically downloads required resources if not present.
Storage Pipeline execution: loads the index from disk (with spinner) or builds it from scratch.
Interactive CLI loop with mode-aware prompt:
- [boolean|T] search> -- boolean mode with thesaurus active (default)
- [tfidf] search> -- tfidf mode without thesaurus
- [bm25|T] search> -- bm25 mode with thesaurus active
- [vsm] search> -- vsm mode
- [dense] search> -- dense mode
Available commands (prefix +):

Command	Description
`+help`	Show all available commands
`+mode boolean\|tfidf\|bm25\|vsm\|dense`	Change ranking algorithm
`+thesaurus`	Toggle on/off query expansion
`+thesaurus on\|off`	Explicitly enable/disable query expansion
`+d<N>`	Show the full content of document N from the last search
`+d<N> <term>`	Show document N with occurrences of `<term>` highlighted
`+index`	Show the complete inverted index
`+index <term>`	Filter the inverted index by term (e.g., `+index planet`)
`+threshold <value>`	Set minimum cosine threshold for dense (default 0.0, range -1..1)
`+topk <N>`	Set maximum results for dense (default 10, 0 = all)
`+bench`	Benchmark: 100 queries x all modes, table sorted by speed
`+plot`	2D vector proximity chart (matplotlib, external window)
`+plot3d`	3D vector proximity chart (matplotlib, rotatable)
`+plotext`	2D vector proximity chart (in terminal, via plotext)
`+info`	Index statistics + current configuration
`+rebuild`	Rebuild the index from scratch
`+rebuild dense`	Rebuild/generate dense embeddings
`+clear`	Clear the screen
`+exit` / `+quit`	Exit the program

Search operators:

Syntax	Description
`cat black`	AND implicit (default)
`cat && black`	AND explicit (symbol)
`cat \|\| black`	OR explicit (symbol)
`cat OR black`	OR inline (keyword)
`OR:cat black sun`	OR with prefix (all terms)
`AND:cat black`	AND with prefix

9. Complete Execution Flow

Example 1: Query "planet" across ranking modes

[boolean] search> planet

  Score = SUM(TF)
  doc1: TF(planet)=7 -> score=7
  doc2: TF(planet)=3 -> score=3
  doc3: TF(planet)=1 -> score=1

[tfidf] search> planet

  IDF(planet) = log(15/4) = 1.322
  Score = TF * IDF
  doc1: 7 * 1.322 = 9.254
  doc2: 3 * 1.322 = 3.966
  doc3: 1 * 1.322 = 1.322

[bm25] search> planet

  IDF_BM25(planet) = log((15-4+0.5)/(4+0.5)+1) = 1.138
  |d1|=141, |d2|=158, avgdl=161.6
  doc1: IDF * 7*2.5 / (7 + 1.5*(1-0.75+0.75*141/161.6)) = 3.887
  doc2: IDF * 1*2.5 / (1 + 1.5*(1-0.75+0.75*158/161.6)) = 1.875

[vsm] search> planet

  Both query and document are represented as TF-IDF vectors.
  Score = cosine similarity between vectors.
  doc1: cos=0.65  (high TF for "planet", focused content)
  doc2: cos=0.32  (lower TF, more diverse vocabulary)

[dense] search> planet

  Query encoded by paraphrase-multilingual-MiniLM-L12-v2 (384-dim).
  Score = cosine similarity between query embedding and document embedding.
  Captures semantic meaning: finds documents about celestial bodies
  even without exact keyword match.

Note how BM25 "compresses" score differences compared to TF-IDF: doc1 with TF=7 does not have a 7x score over doc2 with TF=1 (saturation).

Note how VSM normalizes by vector magnitude: a short, focused document can outrank a longer one with higher raw TF.

Example 2: Query "snake" with thesaurus

[boolean|T] search> snake

  Thesaurus: snake -> serpent, reptile
  The query becomes: ["snake", "serpent", "reptile"]
  Finds documents that use "serpent" even if they do not contain "snake"

Example 3: Polysemy with "python"

[tfidf] search> python

  Finds both the document about Python programming and the one about
  the python snake. TF-IDF ranks by relevance based on term frequency
  and rarity in the collection.

[dense] search> python programming

  Dense mode captures semantic meaning: the programming document scores
  significantly higher because the embedding understands the semantic
  association between "python" and "programming".

Example 4: Cross-language search with thesaurus

[bm25|T] search> serpente

  Thesaurus: serpente -> snake, serpent, reptile
  Cross-language expansion finds the English document about the python snake
  even though the query is in Italian.

10. Usage Instructions

Installation

pip install -r requirements.txt

Dependencies:

nltk>=3.8 -- natural language processing (tokenization, stemming, stop words)
rich>=13.0 -- colored CLI output (tables, panels, spinners)
matplotlib>=3.7 -- 2D/3D vector proximity plots
scikit-learn>=1.3 -- PCA dimensionality reduction for plots
sentence-transformers>=2.2 -- dense embeddings (for dense mode)
plotext>=5.2 -- in-terminal 2D plots

Startup

python main.py

On first run, the system downloads the required NLTK resources and builds the index. Subsequent runs load the index from disk.

Session Examples

[boolean|T] search> planet                    # search "planet" (thesaurus ON by default)
[boolean|T] search> artificial intelligence   # AND: both terms required
[boolean|T] search> planet || mars            # OR: at least one term
[boolean|T] search> snake && python           # AND explicit
[boolean|T] search> +thesaurus                # toggle -> OFF
[boolean] search> python OR learning          # OR alternative without expansion
[boolean] search> +mode tfidf                 # switch to TF-IDF
[tfidf] search> planet                        # same query, different ranking
[tfidf] search> +mode bm25                    # switch to BM25
[bm25] search> planet                         # BM25: saturates TF, normalizes length
[bm25] search> +thesaurus                     # toggle -> ON
[bm25|T] search> serpente                     # cross-language: finds python_the_snake!
[bm25|T] search> +mode vsm                   # switch to VSM
[vsm|T] search> planet sun                    # cosine similarity between vectors
[vsm|T] search> +mode dense                   # switch to Dense (loads model on first use)
[dense] search> planets in our solar system   # semantic search, no keyword matching
[dense] search> +threshold 0.3                # set minimum cosine threshold
[dense] search> +topk 5                       # limit to top 5 results
[dense] search> +d1                           # show full text of result #1
[dense] search> +d1 planet                    # show result #1 with "planet" highlighted
[dense] search> +info                         # show stats + configuration
[dense] search> +bench                        # benchmark all modes (100 queries each)
[dense] search> +plot                         # 2D TF-IDF vector proximity chart
[dense] search> +plot3d                       # 3D vector proximity chart (rotatable)
[dense] search> +plotext                      # 2D chart in terminal
[dense] search> +index planet                 # inspect inverted index for "planet"
[dense] search> +rebuild                      # rebuild inverted index from scratch
[dense] search> +rebuild dense                # regenerate dense embeddings
[dense] search> +quit                         # exit

Polysemic Documents

The collection includes 5 documents designed to demonstrate polysemy:

11_banks_and_rivers.txt -- "bank" as river bank (EN)
12_python_the_snake.txt -- "python" as a snake (EN)
13_mars_mythology.txt -- "mars" as Roman god + Mars candy bar (IT)
14_cells_biology_prison.txt -- "cell" in biology vs prison (EN)
15_mercury_element.txt -- "mercury" as chemical element + god (IT)

Try searching for "python", "mars", "mercury" to see how the engine handles terms with multiple meanings.

Visualization

The engine provides three visualization commands that plot all 15 documents and the query as points in a reduced vector space (via PCA on TF-IDF vectors):

+plot: Opens a 2D matplotlib chart in an external window. Found documents are shown in green, unfound documents in gray, and the query as a red star. Dashed lines connect the query to found documents.
+plot3d: Same as above but in 3D (rotatable).
+plotext: Renders a 2D chart directly in the terminal using plotext.

These visualizations help understand how documents cluster in vector space and why certain documents are closer to the query than others.

Benchmarking

The +bench command runs 100 queries (90 deterministic + 10 random) across all available modes and displays a table sorted by average retrieval time. It measures both the pure retrieval time (comparison + ranking) and the full pipeline time (including parsing, preprocessing, and formatting).

If dense embeddings are not available, the dense mode is excluded from the benchmark with a notification to run +rebuild dense first.

11. Logging

The project uses Python's standard logging module, configured centrally in logging_config.py.

Configuration

The setup_logging() function configures the root logger with two handlers:

Handler	Level	Destination	Purpose
FileHandler	DEBUG	`engine.log`	Captures everything for post-mortem analysis
StreamHandler	WARNING	console (stderr)	Only warnings and errors (Rich handles normal UI)

Log format:

File: 2024-01-15 14:30:22 [INFO ] search.pipeline: Ricerca completata: 3 risultati in 1.23 ms
Console: [WARNING] message

Usage in Modules

Every module that needs logging follows the standard pattern:

import logging
logger = logging.getLogger(__name__)

logger.info("Index loaded with %d terms", num_terms)
logger.debug("Query preprocessing: %s -> %s", raw, processed)
logger.warning("Embeddings not available")

Using __name__ ensures that log messages include the fully qualified module name (e.g., search.pipeline, storage.indexing, search.modes.dense), making it easy to trace where each message originated.

Log File

The engine.log file is created in the project root directory and contains all messages at DEBUG level and above. It is useful for:

Debugging preprocessing issues (which stems were generated)
Tracking search performance (timing data)
Diagnosing model loading issues (sentence-transformers)
Understanding the storage pipeline flow

The file handler uses UTF-8 encoding to correctly handle Italian text in log messages.

Idempotent Setup

setup_logging() checks for existing handlers before adding new ones, ensuring it can be called multiple times without creating duplicate handlers.

12. Testing

The test suite is located in test.py and contains 32 tests covering all components of the project.

Running Tests

python test.py

Output Format

The test runner produces compact output with one line per test:

MOTORE DI RICERCA - TEST (32 test)

[01/32] PASS  test_tokenize
[02/32] PASS  test_normalize
...
[23/32] SKIP  test_dense_build_embeddings (sentence-transformers non installato)
[24/32] SKIP  test_dense_ranking (sentence-transformers non installato)
...
[32/32] PASS  test_storage_pipeline_completa

RISULTATO: 29/32 passati, 0 falliti, 3 saltati
Log scritto in test.log

Failed tests show the full traceback inline. A test.log file is generated with the same format as the console output.

Test Categories

Tests	Category	Description
01-09	Preprocessing	tokenize, normalize, get_stopwords, remove_stopwords, detect_language, stem, preprocess, preprocess_with_details, thesaurus_expand
10-13	Storage	collect_documents, build_inverted_index, persistence (index), persistence (embeddings)
14-18	Search Core	parse_query, query_preprocessor, query_representation, compare
19-22	Ranking Modes	boolean_ranking, tfidf_ranking, bm25_ranking, vsm_ranking
23-24	Dense Embeddings	dense_build_embeddings, dense_ranking
25-28	Dispatcher/Registry	ranking_dispatcher_tutti, modes_registry_completo, format_results_vsm, format_results_dense
29-31	Integration	search_pipeline_vsm, search_pipeline_dense, search_pipeline_keyword
31-32	End-to-end	plot_data_builder, storage_pipeline_completa

Dense Tests and Skipping

Tests that require sentence-transformers (tests 23-24 and 30) are automatically marked as SKIP if the library is not installed, rather than failing. This allows the test suite to pass on environments without GPU or without the heavy sentence-transformers dependency.

Key Assertions

The test suite verifies:

Preprocessing correctness: tokenization handles accented characters, normalization removes single-character tokens, stop words are removed for both IT and EN, stemming produces expected stems, language detection works for both languages.
Thesaurus expansion: "snake" expands to include "serpente" (cross-language bridge), disabled thesaurus produces no expansion.
Storage integrity: 15 documents are collected in alphabetical order, inverted index is non-empty, doc_lengths are positive, persistence round-trips correctly (save then load).
Query parsing: all syntax variants (AND/OR implicit, explicit, prefix, keyword) are correctly parsed.
Ranking correctness: results are sorted by score descending, IDF values follow the expected formula, BM25 exhibits TF saturation (TF=10 vs TF=1 ratio is less than 10), VSM cosine scores are in [0, 1].
Dense embeddings: embeddings are 384-dimensional, semantic search ranks "planets in space" with the solar system document first, threshold and top-k filtering work correctly.
Mode registry: all 5 modes (boolean, tfidf, bm25, vsm, dense) are registered, unknown modes raise ValueError.
Integration: full pipeline produces results for all modes, thesaurus cross-language search works end-to-end, empty queries return found=False.
Plot data: PCA reduction produces correct shapes (16 points = 15 docs + 1 query), query is labeled "QUERY", found documents are correctly identified.

Exit Code

The test runner exits with code 0 if all tests pass (or are skipped), and code 1 if any test fails, making it suitable for CI/CD integration.

Home

Didactic Search Engine

Table of Contents

1. Theoretical Introduction

What Is Information Retrieval

The Inverted Index

Text Preprocessing

Boolean Matching

Ranking: From TF to TF-IDF to BM25

Vector Space Model

Dense Embeddings

Polysemy and Query Expansion

2. Architecture: The Two Pipelines

3. Project Structure

4. Package preprocessing/ - Text Transformation

tokenizer.py - Stage 2: Tokenization

normalizer.py - Stage 3: Normalization

stopword_remover.py - Stage 4: Stop Word Removal

stemmer.py - Stage 5: Stemming

thesaurus.py - Bilingual Query Expansion

pipeline.py - Preprocessing Orchestrator

5. Package storage/ - Storage Pipeline

data_collection.py - Stage 1: Data Collection

indexing.py - Stage 6: Inverted Index Construction

persistence.py - Stage 7: Disk Persistence

pipeline.py - Storage Pipeline Orchestrator

6. Package search/ - Search Pipeline

query_parser.py - Stage 1: Query Parsing

query_preprocessor.py - Stage 2: Query Preprocessing

query_representation.py - Stage 3: Internal Representation

comparison.py - Stage 4: Query vs Index Comparison

ranking.py - Stage 5: Ranking (Dispatcher)

result_formatter.py - Stage 6: Formatting with Rich

pipeline.py - Search Pipeline Orchestrator