Information Retrieval & Web Analytics (IRWA) Lab Activities

This repository focuses on the technical backbone of modern search engines and text processing systems. It documents the journey from basic text manipulation to building complex data structures like Positional Indexes and implementing Boolean logic for document retrieval.

🎯 Core Objectives

Understanding how search engines index vast amounts of unstructured text.
Implementing NLP preprocessing pipelines (Normalization, Tokenization, Stop-word removal).
Developing algorithms for efficient data retrieval and similarity measurement.

📂 Project Roadmap

🔍 1. Search Engine Indexing (The Core)

Inverted Indexing: Implementation of the standard "term-to-document" mapping. Includes logic to handle folder-based corpora where documents are parsed and mapped to a searchable dictionary.
Positional Indexing: Advanced indexing that stores the exact position of words. This enables Phrase Searching (e.g., finding the exact sequence "Machine Learning") rather than just individual keywords.
Boolean Retrieval: Custom functions to perform AND, OR, and NOT operations on indexed terms to filter document results.

✂️ 2. NLP & Text Preprocessing

Preprocessing Pipelines: Systematic removal of punctuation using str.maketrans, lowercasing, and whitespace handling.
Tokenization & N-grams: breaking down sentences into unigrams, bigrams, and trigrams to analyze word frequency and context.
Feature Extraction: Using CountVectorizer to transform text into numerical vectors for machine learning compatibility.

📐 3. Mathematical Foundations

Cosine Similarity: Visualizing and calculating the cosine function—the mathematical foundation for measuring the similarity between two text documents in a vector space.
Statistical Analysis: Methods to count character types (Upper, Lower, Digits, Special) and analyze series summation.

🛠 Tech Stack

Language: Python 3.x
Libraries: - NLTK: For tokenization and N-grams.
- Scikit-Learn: For vectorization and document-term matrices.
- NumPy & Pandas: For data structure management.
- Matplotlib: For mathematical function visualization.

🧪 Detailed Lab Breakdown

IRWA_Lab001: Python fundamentals, list manipulation, and basic string statistics.
IRWA_Lab02 & Part A: Deep dive into text normalization, punctuation removal efficiency, and NLTK tokenization.
IRWA_Lab03 & Lab04: Architecture for building inverted indexes from scratch and handling file-system corpora.
IRWA_Lab05: Implementation of the Positional Index and testing Boolean retrieval logic (AND/NOT queries).

🚀 Usage

To explore the indexing logic:

Clone the repository.
Navigate to IRWA_Lab05.ipynb to see the most advanced implementation of the Positional Index.
Use the provided Teaching Guides for a step-by-step walkthrough of the indexing logic.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Examples_for_lab_03_teaching_guide.ipynb		Examples_for_lab_03_teaching_guide.ipynb
IRWA_Lab001.ipynb		IRWA_Lab001.ipynb
IRWA_Lab02.ipynb		IRWA_Lab02.ipynb
IRWA_Lab02_PartA.ipynb		IRWA_Lab02_PartA.ipynb
IRWA_Lab03 (1).ipynb		IRWA_Lab03 (1).ipynb
IRWA_Lab04.ipynb		IRWA_Lab04.ipynb
IRWA_Lab05.ipynb		IRWA_Lab05.ipynb
IRWA_Lab_01.py		IRWA_Lab_01.py
LICENSE		LICENSE
Lab_03_teaching_guide.ipynb		Lab_03_teaching_guide.ipynb
README.md		README.md
cosine_function.ipynb		cosine_function.ipynb
cosine_function.py		cosine_function.py
irwa_lab001.py		irwa_lab001.py
irwa_lab03.py		irwa_lab03.py
irwa_lab04.py		irwa_lab04.py
irwa_lab05.py		irwa_lab05.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information Retrieval & Web Analytics (IRWA) Lab Activities

🎯 Core Objectives

📂 Project Roadmap

🔍 1. Search Engine Indexing (The Core)

✂️ 2. NLP & Text Preprocessing

📐 3. Mathematical Foundations

🛠 Tech Stack

🧪 Detailed Lab Breakdown

🚀 Usage

About

Uh oh!

Releases

Packages

Languages

License

dyneth02/IRWA-Labs

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval & Web Analytics (IRWA) Lab Activities

🎯 Core Objectives

📂 Project Roadmap

🔍 1. Search Engine Indexing (The Core)

✂️ 2. NLP & Text Preprocessing

📐 3. Mathematical Foundations

🛠 Tech Stack

🧪 Detailed Lab Breakdown

🚀 Usage

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages