Skip to content

Do duplicate words have any outcome on the terms output? #16

@flop71

Description

@flop71

Hello,

I’m using BiTerm for analysis of real-time chat messages. One question that has come up a couple of times, is whether I should remove duplicate words prior to analysis in BiTerm

For example, I have a corpus of text that has 158 words. If I remove the duplicate words I have 111 words remaining. I can perform topic analysis in BiTerm and then the model returns n(W): 111 terms.

If I use the same corpus of text, and perform analysis (without removing the duplicate words), BiTerm shows an output model with n(W):111

What I would like to understand is there any penalty in removing duplicate words prior to analysis Given that BiTerm appears not to analyse the terms as part of the output or should I include the duplicate words prior to BiTerm analysis?

Does Biterm use the duplicate terms as part of the modelling process but only return distinct/unique terms as part of the model output?

Thanks
Jonathan

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions