From a8792396058cb8d8897b65fd82f2bd21fd141e63 Mon Sep 17 00:00:00 2001 From: sophiewax Date: Mon, 10 Mar 2025 22:13:35 -0400 Subject: [PATCH] Sophie HW files & tokenization Some errors I encounter are punctuation issues. Tokenization includes punctuation, which splits the words. For example "Achilles!" is split into "Achilles" and "!" so some words may be hyphenated or contain apostrophes affecting tokenization. Some words may also be treated as separate tokens if they differ in the use of a capital letter. Line breaks and footnotes or spaces between stanzas may also interfere with tokenization. In order to refine the tokenization process, all text should be converted to lowercase to avoid case differences and punctuation should all be removed. --- Sophie HW | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) create mode 100644 Sophie HW diff --git a/Sophie HW b/Sophie HW new file mode 100644 index 0000000..e1be084 --- /dev/null +++ b/Sophie HW @@ -0,0 +1,16 @@ +import nltk +from nltk.tokenize import word_tokenize + +# download resources +nltk.download('punkt') + +# Read file +with open("book_9.txt", "r", encoding="utf-8") as file: + text = file.read() + +# Tokenize the text +tokens = word_tokenize(text) + +# Print tokenized text +print(tokens) +