Automatic Bookmark Organizing with Document Clustering.

Originally, I wanted to create a way to automatically clean up my bookmarks bar. I have hundreds, maybe thousands of bookmarks floating around not sorted into their proper folders.

I then thought, it would be cool to make a Chrome extension that takes list of bookmarks (URLs), automatically clusters them into folders, and rearranges your bookmarks bar!

This would entail:

Learning Chrome's bookmarks API and making the extension
Hierarchical clustering
Cluster labeling (folder names)

For this project, I decided to focus on clustering and labeling. K-means and Hierarchical methods were investigated.

Dataset

For this project, I utilized a 12GB text dump of wikipedia found here. Since the clustering is unsupervised, it isn't necessary to run the algorithms on the entire dataset, though I eventually will do so. In the data folder are a small sample of JSON files of wikipedia articles.

Repository Contents

Practicing topic extraction
Visitor log
Final Report
K-means Clustering and Visualization
- Jupyter Notebook
- Markdown
Hierarchical Clustering and Visualization
- Jupyter Notebook
- Markdown
Class Presentation Slides
Data, JSON files of wikipedia articles
Images, folder of all images used in repository
Cluster Pickle Files, saved .pkl files of clustering outputs
Preprocessing
- Wikiextractor, a submodule of the repo used in converting wikipedia XML dump to json
- Reformatting script, a short custom script for some further preprocessing to convert from JSON Lines format into one dict per file.
License
- GPL License
- License Explanation

NOTE: Before running any of this code, I would recommend installing all dependencies with pip install -r requirements.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
clusters		clusters
data		data
img		img
target		target
wikiextractor @ 2a5e6ae		wikiextractor @ 2a5e6ae
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.md		LICENSE.md
LICENSE_notes.md		LICENSE_notes.md
README.md		README.md
clustering.ipynb		clustering.ipynb
clustering.md		clustering.md
final_report.md		final_report.md
hierarchical.ipynb		hierarchical.ipynb
hierarchical.md		hierarchical.md
hierarchical.py		hierarchical.py
ling1340_slides.pdf		ling1340_slides.pdf
progress_report.md		progress_report.md
project_old.ipynb		project_old.ipynb
project_plan.md		project_plan.md
reformat.py		reformat.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Bookmark Organizing with Document Clustering.

Dataset

Repository Contents

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Automatic Bookmark Organizing with Document Clustering.

Dataset

Repository Contents

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages