Skip to content

Sophie HW files & tokenization#4

Open
sophiewax wants to merge 1 commit intoTuftsIntroDH:mainfrom
sophiewax:patch-1
Open

Sophie HW files & tokenization#4
sophiewax wants to merge 1 commit intoTuftsIntroDH:mainfrom
sophiewax:patch-1

Conversation

@sophiewax
Copy link

Some errors I encounter are punctuation issues. Tokenization includes punctuation, which splits the words. For example "Achilles!" is split into "Achilles" and "!" so some words may be hyphenated or contain apostrophes affecting tokenization. Some words may also be treated as separate tokens if they differ in the use of a capital letter. Line breaks and footnotes or spaces between stanzas may also interfere with tokenization. In order to refine the tokenization process, all text should be converted to lowercase to avoid case differences and punctuation should all be removed.

Some errors I encounter are punctuation issues. Tokenization includes punctuation, which splits the words. For example "Achilles!" is split into "Achilles" and "!" so some words may be hyphenated or contain apostrophes affecting tokenization. Some words may also be treated as separate tokens if they differ in the use of a capital letter. Line breaks and footnotes or spaces between stanzas may also interfere with tokenization. In order to refine the tokenization process, all text should be converted to lowercase to avoid case differences and punctuation should all be removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant