Skip to content

πŸ“¦ Module 1: Data β€” The Fuel for LLMsΒ #3

@malibayram

Description

@malibayram

πŸ“¦ Module 1: Data β€” The Fuel for LLMs

This module focuses on understanding text data, tokenization, and preparing datasets for LLM training.

Tasks to Complete:

  • Lesson 1.1 β€” Understanding Text & The Role of Tokenization

    • Explain different text units: words, subwords, characters
    • Introduce Byte Pair Encoding (BPE)
    • Demonstrate tokenization concepts with examples
  • Lesson 1.2 β€” Practical Tokenization with tiktoken

    • Install and setup tiktoken
    • Implement encoding/decoding tokens
    • Work with vocabulary size and special tokens
    • Create hands-on tokenization examples
  • Lesson 1.3 β€” Exploring Pretraining Datasets (WikiText / OpenWebText)

    • Download and explore WikiText dataset
    • Analyze dataset structure and content
    • Compare different pretraining datasets
    • Data preprocessing techniques
  • Lesson 1.4 β€” Preparing Inputs & Targets

    • Create (input, target) pairs for next-token prediction
    • Implement block_size and batch_size management
    • Handle sequence padding and truncation
  • Lesson 1.5 β€” Efficient Data Handling with PyTorch DataLoader

    • Create custom Dataset class
    • Implement efficient batching logic
    • Setup DataLoader with proper configurations

Deliverables:

  • 5 video lectures (~25 minutes each)
  • Tokenization notebook with examples
  • Dataset preparation notebook
  • Custom Dataset and DataLoader implementation
  • Module quiz

Resources:

  • tiktoken documentation
  • WikiText dataset
  • PyTorch DataLoader tutorials

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions