'narrow no-break space' ("\u202f) is not recognized as a word boundary

Contrary to the 'no-break space' ("\u00A0"), the 'narrow no-break space' ("\u202f") is not recognized as a word boundary. 

tokenize("La vois-tu souvent ?", "fr") 
returns  ['la', 'vois', 'tu', 'souvent\u202f']  instead of ['la', 'vois', 'tu', 'souvent'] 

This is a problem because in French, some punctuation signs like ; : ! ? need to have a non breaking space (ideally a narrow one) between them and the word placed before them.

I suppose one solution would be to modify "TOKEN_RE" in the "tokens" module to take this case into account.  Unless, of course, this would create undesirable effects in other languages. Another solution could be to replace "\u202f" by "\u00A0" when preprocessing French texts.

Thank you anyway for sharing this library which is for me essential when it comes to identifying the rarest words in a text. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

'narrow no-break space' ("\u202f) is not recognized as a word boundary #78

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

'narrow no-break space' ("\u202f) is not recognized as a word boundary #78

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions