SparkNet

SparkNet is a from-scratch, GPT-2–style language model project that focuses on training compact (≈70M parameter) causal decoders with modern Hugging Face tooling. The repo contains everything needed to:

stream and mix public Hugging Face datasets or consume a pre-packed 1B-token corpus
train custom byte-level tokenizers
launch full runs with instrumentation (TensorBoard throughput, gradient norm, periodic sample generation)
export checkpoints that can be reused with standard 🤗 AutoModelForCausalLM APIs

Special thanks to CodeLion for inspiring the One Billion Token Challenge, and for providing the high-quality datasets used in this training run.

Requirements

Linux or WSL2 with Python 3.10+
NVIDIA GPU with at least 24 GB VRAM (32×1024 tokens with GradAccum=2 fits comfortably)
CUDA 12.9 runtime and driver (matching the pinned torch==2.9.0+cu129)
Hugging Face Hub credentials (datasets such as codelion/finepdfs-1B require auth)
~200 GB of free disk for cache + packed dataset + checkpoints

The training scripts enable TF32 kernels and Flash Attention automatically when available.

Installation

git clone https://github.com/<you>/sparknet.git
cd sparknet

python3 -m venv .venv
source .venv/bin/activate

pip install --upgrade pip
pip install -r requirements.txt

Recommended environment variables (already set inside the scripts, but exporting them keeps CLI tools consistent):

export HF_DATASETS_CACHE=~/projects/sparknet/cache
export TOKENIZERS_PARALLELISM=false

Login to the Hugging Face hub once so the streaming scripts can read gated datasets:

huggingface-cli login

Dataset & Config Basics

Training is driven by JSON configs under configs/. Each config defines:

{
  "seed": 42,
  "context_length": 1024,
  "target_tokens": 1000000000,
  "mix": [
    {"name": "codelion/finepdfs-1B", "split": "train", "prob": 0.5},
    {"name": "json", "data_files": {"train": "data/diener_blog.jsonl"}, "prob": 0.002}
  ]
}

mix: list of streaming datasets with sampling probabilities. Use "json" for local JSONL files.
context_length: block size fed into the model.
target_tokens: total training budget (used to derive max steps).

Update or duplicate a config (e.g., configs/datasets_v5.json) and point the training script to it.

Training Workflow

You can either (A) stream data on-the-fly (scripts/train_sparknet.py) or (B) build a static 1B-token dataset once and train on it (scripts/train_sparknet_v5.py). Both pipelines share the same callbacks/logging behavior.

1. (Optional) Train a Custom Tokenizer

python scripts/train_tokenizer_v5.py \
  --maybe_edit_script_for_sources

The script uniformly samples lines from the configured datasets, trains a byte-level BPE (GPT-2 compatible), and stores it under tokenizer-v5/. Edit sources, MAX_LINES, or VOCAB_SIZE inside the script if you want a different mix.

2. (Optional) Build a Static Packed Dataset

train_sparknet_v5.py expects a packed dataset created by scripts/build_dataset_v5.py:

python scripts/build_dataset_v5.py

This script:

loads the tokenizer from tokenizer-v5/
samples text using the weighted mix in DATA_SOURCES
tokenizes until TARGET_TOKENS (default 1B) are accumulated
packs sequences into 1024-token blocks
writes the result to datasets/sparknet-v5-1b/

Adjust DATA_SOURCES, TARGET_TOKENS, or BLOCK_SIZE in the script to suit your run.

3. Launch Training

Streaming (no pre-built dataset)

python scripts/train_sparknet.py \
  --edit RUN_NAME/config path inside script if needed

Key characteristics:

Loads the mix defined in configs/datasets_v1.json (change the filename near the top).
Streams datasets with datasets.interleave_datasets, lazily tokenizes, and packs batches on the fly.
Useful when experimenting with proportions or when disk is limited.

Static dataset (recommended for v5)

python scripts/train_sparknet_v5.py

Consumes the packed dataset from datasets/sparknet-v5-1b/.
Uses the custom tokenizer in tokenizer-v5/.
Enables dropout, cosine LR, AdamW fused optimizer, load_best_model_at_end, etc.

Both scripts will:

write checkpoints to checkpoints/<RUN_NAME>/ (model weights, tokenizer, metadata)
log metrics and throughput to logs/<RUN_NAME>/ (view with tensorboard --logdir logs)
emit periodic text samples as sample_stepXXXX.txt inside the checkpoint directory

You can resume training by re-running the same script pointing at the existing output_dir.

Evaluating a Trained Model

scripts/eval_perplexity.py: compute perplexity on validation sets.
scripts/eval_generation.py: run qualitative generations from prompts.
scripts/eval_benchmarks.py: plug into downstream benchmark harnesses.

Each script accepts the checkpoint path (e.g., checkpoints/sparknet-70m-v5) and will reuse the saved tokenizer/model.

Tips & Troubleshooting

If you hit dataloader bottlenecks, reduce dataloader_num_workers inside the training script.
Hugging Face streaming benefits from a warm cache; keep HF_DATASETS_CACHE on fast SSD.
When experimenting with batch sizes, keep block_size * per_device_train_batch_size * gradient_accumulation_steps within your GPU memory budget.
For offline clusters without internet, set HF_DATASETS_OFFLINE=1 and mirror the needed datasets beforehand.

Happy training!

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
configs		configs
scripts		scripts
tokenizer-v5		tokenizer-v5
tokenizer-v6		tokenizer-v6
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
sparknet.png		sparknet.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SparkNet

Requirements

Installation

Dataset & Config Basics

Training Workflow

1. (Optional) Train a Custom Tokenizer

2. (Optional) Build a Static Packed Dataset

3. Launch Training

Streaming (no pre-built dataset)

Static dataset (recommended for v5)

Evaluating a Trained Model

Tips & Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SparkNet

Requirements

Installation

Dataset & Config Basics

Training Workflow

1. (Optional) Train a Custom Tokenizer

2. (Optional) Build a Static Packed Dataset

3. Launch Training

Streaming (no pre-built dataset)

Static dataset (recommended for v5)

Evaluating a Trained Model

Tips & Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages