The CH3-01-Creating Embedded Chunks Notebook has a extract_doc_text function that currently gives an error as follows:
Error Message
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-6c3e5f5d-bd8b-4fa7-a80a-a7736e55c8df/lib/python3.10/site-packages/nltk/data.py:579, in find(resource_name, paths)
577 sep = "*" * 70
578 resource_not_found = f"\n{sep}\n{msg}\n{sep}\n"
--> 579 raise LookupError(resource_not_found)
The following is the cell that is giving the error.
with open(f"{documents_folder}2303.10130.pdf", mode="rb") as pdf:
doc = extract_doc_text(pdf.read())
print(doc)
Below is the fix required:
%pip install nltk
#Or install nltk as the library in the cluster itself.
import nltk
# Download the required NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')
with open(f"{documents_folder}2303.10130.pdf", mode="rb") as pdf:
doc = extract_doc_text(pdf.read())
print(doc)`