Skip to content

Data-Science-for-Linguists/Document_Clustering

Repository files navigation

Automatic Bookmark Organizing with Document Clustering.

Daniel Zheng, daniel.zheng@pitt.edu

Originally, I wanted to create a way to automatically clean up my bookmarks bar. I have hundreds, maybe thousands of bookmarks floating around not sorted into their proper folders.

I then thought, it would be cool to make a Chrome extension that takes list of bookmarks (URLs), automatically clusters them into folders, and rearranges your bookmarks bar!

This would entail:

  • Learning Chrome's bookmarks API and making the extension
  • Hierarchical clustering
  • Cluster labeling (folder names)

For this project, I decided to focus on clustering and labeling. K-means and Hierarchical methods were investigated.

Dataset

For this project, I utilized a 12GB text dump of wikipedia found here. Since the clustering is unsupervised, it isn't necessary to run the algorithms on the entire dataset, though I eventually will do so. In the data folder are a small sample of JSON files of wikipedia articles.

Repository Contents

NOTE: Before running any of this code, I would recommend installing all dependencies with pip install -r requirements.txt.

About

Dan's term project for LING1340.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors