Search engine for tilde-based websites
Responsible for:
- Discovering + temp storing users
- Discovering + temp storing public websites
Not creepy at all. Responsible for:
- Downloading and creating per-word document-frequency dictionary for tf-idf
- Storing which websites have been tagged with timestamp and hash of content
- Pulling keywords and tagging websites into general tag dictionary
Content explanantion
tokenize_corpusandPorterfiles - are responsible for cleaning corpus data into stemmed tokens. Needsstopwords.txtfile in same dirdatafile - interfaces with numerous text and json files for easy data managementparse_urlfile - handles html, including requests and parsing text and metadatainit_freq_dirfile - creates and/or updates document frequency dictionarycrawlfile - goes thru urls and gathers tags + metadata for dictionaries
This document last updated: Jul 20 2020