Daniel Zheng, daniel.zheng@pitt.edu
Originally, I wanted to create a way to automatically clean up my bookmarks bar. I have hundreds, maybe thousands of bookmarks floating around not sorted into their proper folders.
I then thought, it would be cool to make a Chrome extension that takes list of bookmarks (URLs), automatically clusters them into folders, and rearranges your bookmarks bar!
This would entail:
- Learning Chrome's bookmarks API and making the extension
- Hierarchical clustering
- Cluster labeling (folder names)
For this project, I decided to focus on clustering and labeling. K-means and Hierarchical methods were investigated.
For this project, I utilized a 12GB text dump of wikipedia found here. Since the clustering is unsupervised, it isn't necessary to run the algorithms on the entire dataset, though I eventually will do so. In the data folder are a small sample of JSON files of wikipedia articles.
- Practicing topic extraction
- Visitor log
- Final Report
- K-means Clustering and Visualization
- Hierarchical Clustering and Visualization
- Class Presentation Slides
- Data, JSON files of wikipedia articles
- Images, folder of all images used in repository
- Cluster Pickle Files, saved
.pklfiles of clustering outputs - Preprocessing
- Wikiextractor, a submodule of the repo used in converting wikipedia XML dump to json
- Reformatting script, a short custom script for some further preprocessing to convert from JSON Lines format into one dict per file.
- License
NOTE: Before running any of this code, I would recommend installing all dependencies with pip install -r requirements.txt.