GitHub - jmoy/repec-lda: Classifying economics articles using Latent Dirichlet Allocation

This is a little experiment I ran to cluster economics articles by running the Latent Dirichlet Allocation algorithm on their titles and abstracts.

Dependencies

The code is in Python 3 and some of it is in the form of an IPython notebook. The NLTK, LDA, Scikit-Learn and Matplotlib libraries are used.

Data

The data consists of bibliographic data for selected journals downloaded from RePec. The tree under ftp://ftp.repec.org/opt/amf/RePEc/ conatins metadata in the AMF format. The file repec.tar.xz in the repository contains a mirror of selected parts of this tree.

Running

The clustering code expects the data to live in a sqlite database repec.sqlite. To create this database run

$ ./readrepec.py repec.tar.xz repec.sqlite
$ ./add_jnames.py repec.sqlite

The actual fitting of the LDA model is done by the IPython notebook classify.ipynb. On being run, it creates HTML files index.html and cluster[nn].html containing links to the clustered articles.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
add_jnames.py		add_jnames.py
classify.ipynb		classify.ipynb
readrepec.py		readrepec.py
repec.tar.xz		repec.tar.xz
urls.txt		urls.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dependencies

Data

Running

About

Uh oh!

Releases

Packages

Languages

License

jmoy/repec-lda

Folders and files

Latest commit

History

Repository files navigation

Dependencies

Data

Running

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages