Wikipedia-Search-Engine

Constructed an Inverted Index for a Wikipedia dump of around 60gb that would serve for the retrieval of documents related to the search query with an average retrieval time of around 1 sec per query The link to the dataset can be found here : ftp://10.4.17.131/Datasets/IRE_Monsoon_2017/WikiSearch/

CODE FILES:

parsing.py - File containing all functions related to XML parsing.
processing.py - File containing function which takes as input title, id and content of a wiki page and preprocesses it.
indexing.py - File which performs the actual indexing and secondary level and offset files.
search.py - Main file containing all the code for query processing and retrieval of results

Execution of Code

Prerequisits -

Required Directories

IndexFiles - Initial index gets created here
Secondary_index_files - The secondary level indexed files are stored here term_id wise
Offset_files - Offset files for these secondary_index files are made here
TermDict - Pickle file containing the term-term_id map is made here
Title_id_map - File containing page_id-title map is made here

Required Files

full_wiki.xml - The XML file containing the full data of wikipedia

Execution -

Run Search.py - An infinite loop runs expecting queries that ends with the exit command.

Types of Queries -

Field query - Assuming that fields are small letters(b, i, c, t, r, e) followed by colon and the fields are space separated. “b:sachin i:2003 c:sports”
Boolean query - Assuming that the boolean operators are given in capitals (AND, OR, NOT) and remaining words are space separated. “Sachin AND Dhoni NOT Sehwag”
Normal query - Any sequence of words that doesn’t satisfy the above conditions is considered a normal query. “Virat Kohli”

Performance -

For Queries of -

less than 3 words, time to fetch results is < 1s

between 3 and 7 words, time to fetch results is Around 2-3s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia-Search-Engine

CODE FILES:

Execution of Code

Prerequisits -

Required Directories

Required Files

Execution -

Types of Queries -

Performance -

For Queries of -

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
indexing.py		indexing.py
parsing.py		parsing.py
processing.py		processing.py
search.py		search.py

rishabhmurarka7/Wikipedia-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

Wikipedia-Search-Engine

CODE FILES:

Execution of Code

Prerequisits -

Required Directories

Required Files

Execution -

Types of Queries -

Performance -

For Queries of -

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages