-
Notifications
You must be signed in to change notification settings - Fork 138
Description
Hello,
I’m using BiTerm for analysis of real-time chat messages. One question that has come up a couple of times, is whether I should remove duplicate words prior to analysis in BiTerm
For example, I have a corpus of text that has 158 words. If I remove the duplicate words I have 111 words remaining. I can perform topic analysis in BiTerm and then the model returns n(W): 111 terms.
If I use the same corpus of text, and perform analysis (without removing the duplicate words), BiTerm shows an output model with n(W):111
What I would like to understand is there any penalty in removing duplicate words prior to analysis Given that BiTerm appears not to analyse the terms as part of the output or should I include the duplicate words prior to BiTerm analysis?
Does Biterm use the duplicate terms as part of the modelling process but only return distinct/unique terms as part of the model output?
Thanks
Jonathan