CP423 Assignment 2 - Information Retrieval System

What I Built

Basic search engine that can find phrases in documents and rank them using different math formulas. Works with 249 text files.

Interactive mode

Try out utilizing phrase search and ranked search by running:

python3 a2.py

What It Does

Part 1: Finding Phrases

Searches for exact phrases like "holmes watson" or whatever you want to look for
Finds where words appear next to each other in the right order
Can handle up to 5 words in a phrase
Built an index that remembers where every word is in every document

Part 2: Ranking Documents

Uses TF-IDF (term frequency - inverse document frequency) to score documents
Tested 5 different ways to count term frequency:
1. Binary - just 1 or 0 (word there or not)
2. Raw Count - how many times the word appears
3. Log - log(1 + count) to tone down high frequencies
4. Normalized - frequency divided by document length
5. Double Normalized - 0.5 + 0.5*(count/max_count)

My Findings

Phrase Search Results

"the king" → found in 53 documents (pretty common)
"holmes watson" → only 3 documents (detective stories)
"once upon time" → 84 documents (fairy tales)
"the difference between a blow" → just 1 document (complex phrase test)

TF Scheme Comparison

Tested with queries like "murder mystery" and "love story":

Binary TF: Simple but works okay. Treats all word occurrences the same.

Raw Count TF: Can go crazy with repetitive documents. Like poem-1.txt got a score of 0.2683 for "love story" because it probably says "love" and "story" a million times.

Log Normalized: Much better. Brought poem-1.txt down to 0.0951 which is way more reasonable. This one prevents documents from getting insane scores just because they repeat words a lot.

Normalized TF: Similar to raw count but considers document length. Still lets repetitive docs dominate.

Double Normalized: Most balanced results. Doesn't let any single document completely take over the rankings.

Which One's Best?

Log normalization worked best for me. It finds relevant documents but doesn't let repetitive ones mess up the results. Binary is good if you just care whether words appear at all. Raw count is not great unless you want useless information.

Technical Stuff

Taught me a lot of the course contents
Preprocesses text (lowercase, remove stopwords, etc.)
Uses positional indexing for phrase searches
Cosine similarity for ranking
Handles 249 documents with 38,669 unique terms
Got a lot of python practice

Files

a2.py - main code
data/ - all the text files
requirements.txt - what to install

Setup

pip install -r requirements.txt

Conclusion

Built a working search engine. Different TF schemes definitely matter. Some give less useful results, others are more balanced. Log normalization seems like the sweet spot for usefullness. Phrase searching works great for finding exact sequences. Overall was a fun project to work on and build.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
.gitignore		.gitignore
README.md		README.md
a2.py		a2.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CP423 Assignment 2 - Information Retrieval System

What I Built

Interactive mode

What It Does

Part 1: Finding Phrases

Part 2: Ranking Documents

My Findings

Phrase Search Results

TF Scheme Comparison

Which One's Best?

Technical Stuff

Files

Setup

Conclusion

About

Uh oh!

Releases

Packages

Languages

souppman/cp423A2

Folders and files

Latest commit

History

Repository files navigation

CP423 Assignment 2 - Information Retrieval System

What I Built

Interactive mode

What It Does

Part 1: Finding Phrases

Part 2: Ranking Documents

My Findings

Phrase Search Results

TF Scheme Comparison

Which One's Best?

Technical Stuff

Files

Setup

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages