WikiSearch is a tool that aims to emulate in the most accurate way a search engine.
The user inserts a query through a command-line interface and the most relevant Wikipedia pages
are retrieved from the simple Wikipedia dump of April 2007.
The 10 most relevant pages are saved in [result.html].
The algorithm under the tool is based on the work of Taher H. Haveliwala "Topic Sensitive PageRank".
- run
python3 wikiSearch.py. - insert your query.
- insert ' quit' if you want to shut down the program.
dataset.pyretrieves all the .html files and generatesdata.jsonwhere a graph representing the structure of Wikipedia is stored. Andmeta.jsonwhere the keywords for each Wikipedia page is stored.corpus.pyextracts the corpus of a Wikipedia page and preprocess it.pageRank.pycomputes the famous PageRank algorithm.topicPageRank.pycomputes a topic specific PageRank, for each topic two.jsonfile are created, atopic_rank.jsonwhich contains the rank biased on that specific topic, andtopic_tf_dict.jsonwhich contains a term-frequency dictionary for each topic. The topics used for this tool have been selected from the most common keywords of the data and are:- Actor
- Animal
- Book
- Computer
- Film
- Food
- Football
- Government
- Mathematics
- Music
- Plant
wikiSearch.pyis the interface of the tools, it takes the query as input and, after computing the rank accordingly to the paper cited above, it returns the 10 most relevant pages.
