Skip to content

itsDZhang/Search-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Search Engine

1st DEMO

2nd DEMO

This is a minified search engine that specializes in discovering the top 10 most relevant documents in the Los Angeles Times Collection. The collection has 136k+ documents but this search engine's performance can retrieve those relevant documents in several milliseconds.

This Search Engine Consists of:

  • IndexEngine
  • Lexicon
  • Query Interpreter
  • Snippet Engine
  • Ranking Engine

Note: Does not have a web crawler.

Ranking Function: BM25

BM25

Future Implementation:

Cosine Similarity (Vector Space)

Cosine

Dk

Indexing

The Index Engine creates an inverted index such that as it is indexing each document, it's also tokenizing each word as an id and mapping it to an postings list. The posting list consists of the document id and the number of times the word appears in that document. Using an inverted index saves a significant amount of space compared to a matrix form.

Snippets

Query Biased Summary

The snippet summaries implemented underneath each document are biased towards the a given query and thus dynamic. In other words, the snippet will change depending on the query that the user inputs.

About

(NLP) Developed from scratch, a fully functional Search Engine.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages