π¦ Module 1: Data β The Fuel for LLMs
This module focuses on understanding text data, tokenization, and preparing datasets for LLM training.
Tasks to Complete:
Deliverables:
Resources:
tiktoken documentation
- WikiText dataset
- PyTorch DataLoader tutorials
π¦ Module 1: Data β The Fuel for LLMs
This module focuses on understanding text data, tokenization, and preparing datasets for LLM training.
Tasks to Complete:
Lesson 1.1 β Understanding Text & The Role of Tokenization
Lesson 1.2 β Practical Tokenization with
tiktokentiktokenLesson 1.3 β Exploring Pretraining Datasets (WikiText / OpenWebText)
Lesson 1.4 β Preparing Inputs & Targets
block_sizeandbatch_sizemanagementLesson 1.5 β Efficient Data Handling with PyTorch
DataLoaderDeliverables:
Resources:
tiktokendocumentation