Skip to content
/ repec-lda Public

Classifying economics articles using Latent Dirichlet Allocation

License

Notifications You must be signed in to change notification settings

jmoy/repec-lda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is a little experiment I ran to cluster economics articles by running the Latent Dirichlet Allocation algorithm on their titles and abstracts.

Dependencies

The code is in Python 3 and some of it is in the form of an IPython notebook. The NLTK, LDA, Scikit-Learn and Matplotlib libraries are used.

Data

The data consists of bibliographic data for selected journals downloaded from RePec. The tree under ftp://ftp.repec.org/opt/amf/RePEc/ conatins metadata in the AMF format. The file repec.tar.xz in the repository contains a mirror of selected parts of this tree.

Running

The clustering code expects the data to live in a sqlite database repec.sqlite. To create this database run

$ ./readrepec.py repec.tar.xz repec.sqlite
$ ./add_jnames.py repec.sqlite

The actual fitting of the LDA model is done by the IPython notebook classify.ipynb. On being run, it creates HTML files index.html and cluster[nn].html containing links to the clustered articles.

About

Classifying economics articles using Latent Dirichlet Allocation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published