Sophie HW files & tokenization by sophiewax · Pull Request #4 · TuftsIntroDH/Files-and-Tokenization-Lab

sophiewax · 2025-03-11T02:13:45Z

Some errors I encounter are punctuation issues. Tokenization includes punctuation, which splits the words. For example "Achilles!" is split into "Achilles" and "!" so some words may be hyphenated or contain apostrophes affecting tokenization. Some words may also be treated as separate tokens if they differ in the use of a capital letter. Line breaks and footnotes or spaces between stanzas may also interfere with tokenization. In order to refine the tokenization process, all text should be converted to lowercase to avoid case differences and punctuation should all be removed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sophie HW files & tokenization#4

Sophie HW files & tokenization#4
sophiewax wants to merge 1 commit intoTuftsIntroDH:mainfrom
sophiewax:patch-1

sophiewax commented Mar 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sophiewax commented Mar 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant