Skip to content

SchoolofAI-Algiers/llm-from-scratch

Repository files navigation

llm-from-scratch

Build a Large Language Model from Scratch
This repository contains an implementation of GPT-2, built and trained from scratch on the Tiny Stories Dataset. The dataset consists of short stories ideal for experimenting with language modeling.

Project Structure

  • model.py
    Contains the GPT-2 model architecture.

  • tokenizer.py
    Implements a custom tokenizer using SentencePiece. This file includes:

    • The CustomTokenizer class which:
      • Initializes the tokenizer with a model prefix.
      • Trains the tokenizer on a subset of JSON files (from the Tiny Stories Dataset).
      • Loads the trained model.
      • Provides methods to encode text to token IDs and decode token IDs back into text.
    • Utility functions:
      • process_dataset: Processes all JSON files in a data folder, tokenizes the stories, and saves them as binary files.
      • process_data: A helper function to train the tokenizer and process the dataset.
  • train.py
    The training script that uses the tokenized data to train the GPT-2 model from scratch.

  • text_generation.py
    Contains code for generating text using the trained model. This script implements multiple sampling techniques (temperature scaling, top‑k sampling, and top‑p/nucleus sampling) for flexible text generation.

How to Get Started

  1. Clone the Repository

    git clone https://github.com/yourusername/llm-from-scratch.git
    cd llm-from-scratch
  2. Install Dependencies
    Ensure you have Python 3.8+ installed, then install the required packages:

    pip install -r requirements.txt
  3. Prepare the Dataset
    Download the Tiny Stories Dataset and place the JSON files into the data/ folder.

  4. Train the Tokenizer and Process the Data
    Run the tokenizer script to train the SentencePiece model and process the dataset:

    python tokenizer.py
  5. Train the Model
    Use the training script to train your GPT-2 model from scratch:

    python train.py
  6. Generate Text
    After training, generate text using the following command:

    python text_generation.py --prompt "Once upon a time" --max_length 500 --temperature 0.8 --top_k 40 --top_p 0.9 --weights ./weights/gpt_weights.pt

Weights and Model Sharing

The trained model weights will be made available on our Hugging Face account soon. Check that page for updates and downloads.

Contributors

  • Zaineb Boukhetala
  • Kamel Brouthen
  • Meriem Bouyahiaoui
  • Ouail Mohammed Oucherif
  • Lina Amdirt
  • Nesrine Abdelhak
  • Nada Djedjig
  • Larbi Saidchikh

About

Build a Large Language Model from Scratch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •