Skip to content

Do not double-count >=2-grams  #3

@gregdan3

Description

@gregdan3

Currently, in strings like "a a a", the bi-gram "a a" is counted twice even though it technically only appears completely once.
This could be corrected by keeping track of:

  • The last match
  • Whether we are currently overlapping with the last match
    And checking whether the current match is equal to the last match.

It is arguable whether this is a bug in the first place. For example, in the string "toki pona ala", "toki pona" and "pona ala" technically "double count" the occurrence of "pona." Will need to do some Reading The Literature:tm: to find out.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions