Skip to content
Jinho D. Choi edited this page Mar 14, 2017 · 13 revisions

Document Clustering

  • Download docs.trn.tsv.

  • Each line in the file represents a document, where the format is as follows:

    line ::= <label><tab><document>
    document ::= <token>(<space><token>)*
    
  • Create a vector for each document using bag-of-words and TF-IDF. A sample python code for the vector creation can be found here: hw1.py.

  • Implement and run the k-means clustering algorithm on all documents using both bag-of-words and TF-IDF, where k = 7.

  • Experiment with different sets of randomly selected centroids. Measure the purity score of each trial.

  • Implement and experiment with the k-means++ clustering algorithm and compare its results to the ones achieved by the k-means clustering algorithm.

  • Write a report describing your approach, results, and analysis. Use the ACL latex template.

Submission

CS571: Natural Language Processing

Instructor


Emory University

Clone this wiki locally