Gibbs sampling algorithms for several topic models, such as LDA, AT, and coAT.
Topic model is family of generative probabilistic models for discovering the main themes from a collection of documents. For more elaborate and detailed surveys, we refer the readers to [1]. Examples of topic models include Latent Dirichlet Allocation (LDA) [2][3][4], Author-Topic (AT) model [5][6][7], and co-Author-Topic (coAT) model [8], and many others.
The inference for topic models usually cannot be done exactly. A variety of approximate inference algorithms have appeared in recent years, such as stochastic variational inference, mean-field variational methods, expectation propagation, and Monte Carlo Markov chain sampling (MCMC). In this toolbox, Gibbs sampling, a special case of MCMC, is utilized, since it provides a simple method for obtaining parameter estimates under Dirichlet priors and allows combination of estimates from several local maxima of the posterior distribution.
We highly appreciate any suggestion, comment, and bug report.
Code (c) 2011 Jacob Eisenstein Licensed under the Apache License, version 2.0
Please refer to the sample in data/nips.
*LDA: Please refer to LDA.java in the package cn/edu/bjut/ui, and refer to LDA.properties for the parameter setting.
*AT: Please refer to AT.java in the package cn/edu/bjut/ui, and refer to AT.properties for the parameter setting.
*coAT: Please refer to coAT.java in the package cn/edu/bjut/ui, and refer to coAT.properties for the parameter setting.
This toolbox is written by XU, Shuo from Beijing University of Technology. If you find this toolbox useful, please cite GibbsTopicModels as follows:
Xin An, Shuo Xu, Yali Wen, and Mingxing Hu, 2014. A Shared Interest Discovery Model for Coauthor Relationship in SNS. International Journal of Distributed Sensor Networks, Vol. 2014, No. 820715, pp. 1-9.
For any question, please contact XU, Shuo xushuo@bjut.edu.cn OR pzczxs@gmail.com.
[1] David M. Blei, 2012. Introduction to Probabilistic Topic Models. Communications of the ACM, Vol. 55, No. 4, pp. 77-84.
[2] David M. Blei, Andrew Y. Ng, and Michael I. Jordan, 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research, Vol. 3, No. Jan, pp. 993-1022.
[3] Thomas L. Griffiths and Mark Steyvers, 2004. Finding Scientific Topics. Proceedings of the National Academy of Sciences of the United States of America, Vol. 101, No. Suppl, pp. 5228-5235.
[4] Gregor Heinrich, 2009. Parameter Estimation for Text Analysis. Technical Report Version 2.9. vsonix GmbH and University of Leipzig.
[5] Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth, 2004. The Author-Topic Model for Authors and Documents. Proceedings of the 20th International Conference on Uncertainty in Artificial Intelligence, pp. 487-494.
[6] Mark Steyvers, Padhraic Smyth, and Thomas Griffiths, 2004. Probabilistic Author-Topic Models for Information Discovery. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 306-315.
[7] Michal Rosen-Zvi, Chaitanya Chemudugunta, Thomas Griffiths, and Padhraic Smyth, and Mark Steyvers, 2010. Learning Author-Topic Models from Text Corpora. ACM Transactions on Information Systems, Vol. 28, No. 1, pp. 1-38.
[8] Xin An, Shuo Xu, Yali Wen, and Mingxing Hu, 2014. A Shared Interest Discovery Model for Coauthor Relationship in SNS. International Journal of Distributed Sensor Networks, Vol. 2014, No. 820715, pp. 1-9.