Skip to content

Fix unicode processing + ` ` support

Choose a tag to compare

@Lol4t0 Lol4t0 released this 12 Jan 09:22
· 52 commits to develop since this release
  • As STOP_WORDS are stored in unicode format we should keep our words candidates in unicode also to be able to compare candidates against dictionary correctly
  • With some languages, short stopwords are linked to the next word in the sentance with no-breakable-space. To designate those stop words we should support nbsp when tokenizing. Russian is an example. So this fixes grangier#223