bare-metal-transformer

C++ transformer inference engine. Loads pretrained GPT-Neo weights, tokenizes with GPT-2 BPE, generates text autoregressively with KV-cached multi-head attention. No external dependencies — weights loaded via fread, math done with hand-written NEON intrinsics, memory managed with malloc/free.

Tested with TinyStories-33M (768d, 4 layers, 16 heads, 50257 vocab).

$ ./engine tinystories.bin --tokens "7454 2402 257 640" --tokenizer tokenizer.bin --temperature 0.8

  Generated: , there was a little family lived in a big garden
             in a house in a far far far, far, lived.. sunny world. [EOS]

  124.53 tokens/sec | 8.03 ms/tok | Apple M2, single thread

Performance

Apple M2, 8 GB, single-threaded, -O2:

Config	Params	tok/s	ms/tok
128d / 2L / 4H	460K	17,627	0.06
256d / 4L / 8H	3.28M	3,156	0.32
TinyStories-33M	33M	125	8.0
512d / 6L / 8H	26.7M	519	1.93

The small-model numbers are inflated by the benchmark fitting in L2. The 33M number is the honest one — it's memory-bound, which is typical for real inference.

Build

g++ -O2 -std=c++17 -I include src/*.cpp -o engine

Tests:

g++ -O2 -std=c++17 -I include src/matmul.cpp src/utils.cpp src/transformer.cpp \
    src/loader.cpp src/tokenizer.cpp tests/test_engine.cpp -o test_engine
./test_engine   # 49 tests

Running TinyStories

Requires transformers and torch for the one-time weight export. The engine itself has no Python dependency.

python3 scripts/export_tinystories.py
./engine models/tinystories.bin \
  --tokens "7454 2402 257 640" \
  --tokenizer models/tokenizer.bin \
  --temperature 0.8 \
  --gen_tokens 100

The --tokens flag takes pre-tokenized IDs from the GPT-2 tokenizer. This bypasses the C++ BPE encoder, which doesn't implement GPT-2's regex pre-tokenization and can produce slightly different splits. For demo accuracy, tokenize in Python first:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("roneneldan/TinyStories-33M")
print(tok.encode("Once upon a time"))  # [7454, 2402, 257, 640]

Architecture

Pre-norm GPT-2/Neo. Each token goes through:

tok_embed[token] + pos_embed[pos]
 -> N x { LN -> MHA(Q,K,V, KV cache) + residual -> LN -> FFN(GELU) + residual }
 -> LN -> lm_head -> logits -> argmax or temperature sampling

Attention uses implicit causal masking — the inner loop only iterates [0..pos], so no explicit mask tensor is allocated.

KV cache layout is [n_layers, seq_len, dim] contiguous. Each new token writes one row via memcpy. This avoids the quadratic recompute that makes naive attention unusable for generation.

Weight format (.bin v2)

Header (28 bytes, 7 x int32):
  0x454E4731        magic
  vocab_size        50257 for GPT-2
  dim               768 for TinyStories-33M
  hidden_dim        3072
  n_layers          4
  n_heads           16
  seq_len           512

Body (float32, row-major, sequential):
  token_embedding   [vocab_size, dim]
  pos_embedding     [seq_len, dim]
  per layer:
    wq, wk, wv, wo  [dim, dim] each
    bo               [dim]
    w1, b1           [dim, hidden_dim], [hidden_dim]
    w2, b2           [hidden_dim, dim], [dim]
    ln1 gamma/beta   [dim] each
    ln2 gamma/beta   [dim] each
  ln_final gamma/beta  [dim] each
  lm_head            [dim, vocab_size]

The v1 format (no magic, no pos embeddings, no biases) is auto-detected and still loads.

Known limitations

BPE encoder doesn't implement GPT-2's regex-based word splitting. Use --tokens for exact results.
No quantization. Weights are float32. The 33M model is 404 MB on disk.
Single-threaded. No batching.
GELU uses the tanh approximation (gelu_new), not exact GELU.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
benchmarks		benchmarks
include		include
models		models
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
Makefile		Makefile
demo.png		demo.png
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bare-metal-transformer

Performance

Build

Running TinyStories

Architecture

Weight format (.bin v2)

Known limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

bare-metal-transformer

Performance

Build

Running TinyStories

Architecture

Weight format (.bin v2)

Known limitations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages