Skip to content

Tokenization option min_word_length counts length in bytes #5

@isaackd

Description

@isaackd

This should count by actual "characters"
https://github.com/isaackd/wcloud-dev/blob/e368d53dd4d6fb7fcef084ed98225dc54a054a29/src/tokenizer.rs#L46-L48
From https://doc.rust-lang.org/std/primitive.str.html#method.len:

This length is in bytes, not chars or graphemes. In other words, it might not be what a human considers the length of the string.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions