This project implements a lightweight Information Retrieval (IR) pipeline using Node.js, Express, and MongoDB. It was originally created as an academic / research project to explore concepts like document preprocessing, inverted index construction, TF-IDF computation, query ranking, and search with cosine-based scoring.
The system loads raw text documents, preprocesses them, builds an inverted index, and provides a search endpoint that returns relevant documents ranked by similarity score.
/routes
documents.js – Save documents, build inverted index, search
/models
Document.js – Mongoose schema for processed docs
FichierInverse.js – Schema for inverted index entries
/utils
saveDocumentsFromFolder.js
query.js – TF-IDF search + scoring
creationfichierinverse.js
/Collection_TIME – Folder containing raw text files
- Reads
.txtfiles from/Collection_TIME - Preprocesses text
- Computes term frequencies
- Saves documents into MongoDB
- Creates TF-IDF-ready inverted index
- Stores posting lists and document term frequencies
- Tokenizes user query
- Computes TF-IDF score
- Ranks documents by similarity
- Returns most relevant content
Loads and stores all documents.
Creates the TF-IDF inverted index.
Request: { "request": "your keywords" }
Response: { "results": [ { "fileName": "...", "content": "...", "score": 0.87 } ] }
- fileName
- content
- indexdoc.index
- indexdoc.frequency
- terme
- nb_doc
- posting[fileName, frequency]
- Node.js / Express
- MongoDB / Mongoose
- TF-IDF scoring
- Basic IR engine design
npm install
npm start
Workflow:
- Add documents to
/Collection_TIME - POST /save
- POST /build-inverted-index
- POST /search
- Add stemming & stop words
- Improve cosine similarity
- Add pagination & highlighting
- Add frontend viewer