Skip to content

'narrow no-break space' ("\u202f) is not recognized as a word boundary #78

@LBeaudoux

Description

@LBeaudoux

Contrary to the 'no-break space' ("\u00A0"), the 'narrow no-break space' ("\u202f") is not recognized as a word boundary.

tokenize("La vois-tu souvent ?", "fr")
returns ['la', 'vois', 'tu', 'souvent\u202f'] instead of ['la', 'vois', 'tu', 'souvent']

This is a problem because in French, some punctuation signs like ; : ! ? need to have a non breaking space (ideally a narrow one) between them and the word placed before them.

I suppose one solution would be to modify "TOKEN_RE" in the "tokens" module to take this case into account. Unless, of course, this would create undesirable effects in other languages. Another solution could be to replace "\u202f" by "\u00A0" when preprocessing French texts.

Thank you anyway for sharing this library which is for me essential when it comes to identifying the rarest words in a text.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions