This is a little experiment I ran to cluster economics articles by running the Latent Dirichlet Allocation algorithm on their titles and abstracts.
The code is in Python 3 and some of it is in the form of an IPython notebook. The NLTK, LDA, Scikit-Learn and Matplotlib libraries are used.
The data consists of bibliographic data for selected journals downloaded from RePec. The tree under ftp://ftp.repec.org/opt/amf/RePEc/ conatins metadata in the AMF format. The file repec.tar.xz in the repository contains a mirror of selected parts of this tree.
The clustering code expects the data to live in a sqlite database repec.sqlite. To create this database run
$ ./readrepec.py repec.tar.xz repec.sqlite
$ ./add_jnames.py repec.sqlite
The actual fitting of the LDA model is done by the IPython notebook classify.ipynb. On being run, it creates HTML files index.html and cluster[nn].html containing links to the clustered articles.