diff --git a/PROGRAM_FLOW.md b/PROGRAM_FLOW.md new file mode 100644 index 0000000..81cadd6 --- /dev/null +++ b/PROGRAM_FLOW.md @@ -0,0 +1,45 @@ +# Program Flow Overview + +This document outlines the high-level workflow of the project and shows how the main modules interact. The diagram summarizes the preprocessing, training and generation stages. + +```mermaid +flowchart TD + subgraph Preprocessing + A1([Raw text files]) --> A2(train_bpe.py) + A2 --> A3[Trained BytePairTokenizer] + A1 --> A4(tokenize_dataset.py) + A4 --> A5[Tokenized dataset] + end + + subgraph Training + B1[TokenIDDataset / TokenIDSubset] --> B2[DataLoader] + B2 --> B3[Trainer] + B3 --> B4[GPT Model] + B3 --> B5[Checkpoint] + end + + subgraph Generation + C1[Checkpoint] --> C2[GPT Model] + C2 --> C3[Sequencer] + C3 --> C4(Output text) + end + + A3 -->|used by| B1 + A5 -->|used by| B1 + B5 -->|loads| C2 +``` + +## Stage descriptions + +- **Preprocessing** + - `train_bpe.py` trains a `BytePairTokenizer` from raw text files. + - `tokenize_dataset.py` uses the trained tokenizer to convert the corpus into sequences of token IDs. + +- **Training** + - `TokenIDDataset`/`TokenIDSubset` create iterable datasets from the tokenized files. + - `DataLoader` feeds batches to the `Trainer`. + - `Trainer` runs the training loop on the `GPT` model, saving checkpoints for later use. + +- **Generation** + - The saved checkpoint is loaded into the `GPT` model. + - The `Sequencer` performs autoregressive decoding to produce the final text output. diff --git a/README.md b/README.md index d75bc51..d26fc81 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ The model's performance was particularly notable in tasks requiring contextual u ## Implementation 🛠️ This repository provides implementation details and resources for GPT-1. Users can utilize this model for various NLP tasks, adapting it to specific requirements and datasets. -Additional documentation is available in [MODEL.md](model/MODEL.md) and [DATASET.md](model/DATASET.md). +Additional documentation is available in [MODEL.md](model/MODEL.md), [DATASET.md](model/DATASET.md), and [PROGRAM_FLOW.md](PROGRAM_FLOW.md). ### Getting Started ⚡ See [SETUP.md](SETUP.md) for environment setup and training instructions. Once dependencies are installed you can pretrain the model with: