Warning: Careful using a custom tokenizer...

I tried to use the spaCy tokenizer, nltk `word_tokenizer`, `sacremoses` `MosesTokenizer`, nltk `TreebankWordTokenizer`, and nltk `TweetTokenizer`.

For this example, `"inch BBL, unquote, cost $29.95"` they will all output `['inch', 'BBL', ',', 'unquote', ',', 'cost', '$', '29.95', '.']`. This output is incompatible with `normalise` because it'll predict `"inch B B L, unquote, cost $twenty nine point nine five."`.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warning: Careful using a custom tokenizer... #122

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Warning: Careful using a custom tokenizer... #122

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions