Basic search engine that can find phrases in documents and rank them using different math formulas. Works with 249 text files.
Try out utilizing phrase search and ranked search by running:
python3 a2.py- Searches for exact phrases like "holmes watson" or whatever you want to look for
- Finds where words appear next to each other in the right order
- Can handle up to 5 words in a phrase
- Built an index that remembers where every word is in every document
- Uses TF-IDF (term frequency - inverse document frequency) to score documents
- Tested 5 different ways to count term frequency:
- Binary - just 1 or 0 (word there or not)
- Raw Count - how many times the word appears
- Log - log(1 + count) to tone down high frequencies
- Normalized - frequency divided by document length
- Double Normalized - 0.5 + 0.5*(count/max_count)
- "the king" → found in 53 documents (pretty common)
- "holmes watson" → only 3 documents (detective stories)
- "once upon time" → 84 documents (fairy tales)
- "the difference between a blow" → just 1 document (complex phrase test)
Tested with queries like "murder mystery" and "love story":
Binary TF: Simple but works okay. Treats all word occurrences the same.
Raw Count TF: Can go crazy with repetitive documents. Like poem-1.txt got a score of 0.2683 for "love story" because it probably says "love" and "story" a million times.
Log Normalized: Much better. Brought poem-1.txt down to 0.0951 which is way more reasonable. This one prevents documents from getting insane scores just because they repeat words a lot.
Normalized TF: Similar to raw count but considers document length. Still lets repetitive docs dominate.
Double Normalized: Most balanced results. Doesn't let any single document completely take over the rankings.
Log normalization worked best for me. It finds relevant documents but doesn't let repetitive ones mess up the results. Binary is good if you just care whether words appear at all. Raw count is not great unless you want useless information.
- Taught me a lot of the course contents
- Preprocesses text (lowercase, remove stopwords, etc.)
- Uses positional indexing for phrase searches
- Cosine similarity for ranking
- Handles 249 documents with 38,669 unique terms
- Got a lot of python practice
a2.py- main codedata/- all the text filesrequirements.txt- what to install
pip install -r requirements.txtBuilt a working search engine. Different TF schemes definitely matter. Some give less useful results, others are more balanced. Log normalization seems like the sweet spot for usefullness. Phrase searching works great for finding exact sequences. Overall was a fun project to work on and build.