Skip to content

souppman/cp423A2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CP423 Assignment 2 - Information Retrieval System

What I Built

Basic search engine that can find phrases in documents and rank them using different math formulas. Works with 249 text files.

Interactive mode

Try out utilizing phrase search and ranked search by running:

python3 a2.py

What It Does

Part 1: Finding Phrases

  • Searches for exact phrases like "holmes watson" or whatever you want to look for
  • Finds where words appear next to each other in the right order
  • Can handle up to 5 words in a phrase
  • Built an index that remembers where every word is in every document

Part 2: Ranking Documents

  • Uses TF-IDF (term frequency - inverse document frequency) to score documents
  • Tested 5 different ways to count term frequency:
    1. Binary - just 1 or 0 (word there or not)
    2. Raw Count - how many times the word appears
    3. Log - log(1 + count) to tone down high frequencies
    4. Normalized - frequency divided by document length
    5. Double Normalized - 0.5 + 0.5*(count/max_count)

My Findings

Phrase Search Results

  • "the king" → found in 53 documents (pretty common)
  • "holmes watson" → only 3 documents (detective stories)
  • "once upon time" → 84 documents (fairy tales)
  • "the difference between a blow" → just 1 document (complex phrase test)

TF Scheme Comparison

Tested with queries like "murder mystery" and "love story":

Binary TF: Simple but works okay. Treats all word occurrences the same.

Raw Count TF: Can go crazy with repetitive documents. Like poem-1.txt got a score of 0.2683 for "love story" because it probably says "love" and "story" a million times.

Log Normalized: Much better. Brought poem-1.txt down to 0.0951 which is way more reasonable. This one prevents documents from getting insane scores just because they repeat words a lot.

Normalized TF: Similar to raw count but considers document length. Still lets repetitive docs dominate.

Double Normalized: Most balanced results. Doesn't let any single document completely take over the rankings.

Which One's Best?

Log normalization worked best for me. It finds relevant documents but doesn't let repetitive ones mess up the results. Binary is good if you just care whether words appear at all. Raw count is not great unless you want useless information.

Technical Stuff

  • Taught me a lot of the course contents
  • Preprocesses text (lowercase, remove stopwords, etc.)
  • Uses positional indexing for phrase searches
  • Cosine similarity for ranking
  • Handles 249 documents with 38,669 unique terms
  • Got a lot of python practice

Files

  • a2.py - main code
  • data/ - all the text files
  • requirements.txt - what to install

Setup

pip install -r requirements.txt

Conclusion

Built a working search engine. Different TF schemes definitely matter. Some give less useful results, others are more balanced. Log normalization seems like the sweet spot for usefullness. Phrase searching works great for finding exact sequences. Overall was a fun project to work on and build.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages