Do duplicate words have any outcome on the terms output?

Hello,

I’m using BiTerm for analysis of real-time chat messages. One question that has come up a couple of times, is whether I should remove duplicate words prior to analysis in BiTerm

For example, I have a corpus of text that has 158 words. If I remove the duplicate words I have 111 words remaining. I can perform topic analysis in BiTerm and then the model returns n(W): 111 terms.

If I use the same corpus of text, and perform analysis (without removing the duplicate words),  BiTerm shows an output model with  n(W):111

What I would like to understand is there any penalty in removing duplicate words prior to analysis Given that BiTerm appears not to analyse the terms as part of the output or should I include the duplicate words prior to BiTerm analysis?

Does Biterm use the duplicate terms as part of the modelling process but only return distinct/unique terms as part of the model output?

Thanks
Jonathan


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do duplicate words have any outcome on the terms output? #16

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Do duplicate words have any outcome on the terms output? #16

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions