The phenomenon of LDA training is that the first several training is very costly, this is largely due to the uniformly random initialization that the word-topic thus doc-topic is quite dense.
There are two approaches:
- sparse initialization that constraints a word to only a part (like 1%) (randomly) of all topics, and for each tokens of that word, randomly sample from those constrained topics rather than all topics.
- First use part of corpus (like 1%) to train several iterations to initialize the word-topic distribution, which should be quite sparse than uniformly random initialization.