Homework 1

Document Clustering

Download docs.trn.tsv.
Each line in the file represents a document, where the format is as follows:
```
line ::= <label><tab><document>
document ::= <token>(<space><token>)*
```
Create a vector for each document using bag-of-words and TF-IDF. A sample python code for the vector creation can be found here: hw1.py.
Implement and run the k-means clustering algorithm on all documents using both bag-of-words and TF-IDF, where k = 7.
Experiment with different sets of randomly selected centroids. Measure the purity score of each trial.
Implement and experiment with the k-means++ clustering algorithm and compare its results to the ones achieved by the k-means clustering algorithm.
Write a report describing your approach, results, and analysis. Use the ACL latex template.

Submission

Compress your code and report into hw1.zip and submit it to: https://canvas.emory.edu/courses/29596/assignments/30886

CS571: Natural Language Processing

Syllabus.
Schedule.

Instructor

Jinho D. Choi

Emory University

Homework 1

Document Clustering

Submission

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CS571: Natural Language Processing

Instructor

Clone this wiki locally