mojo-tokenizer

High-performance BPE tokenizer for Mojo — 144M tok/s decoding, 3.1x faster than tiktoken.

Performance

Implementation	Decoding	Encoding	vs tiktoken
mojo-tokenizer	144 M/s	8.0 M/s	3.1x faster
rs-bpe (Rust)	121 M/s	10.0 M/s*	2.6x faster
tiktoken (Rust)	47 M/s	5.1 M/s	baseline

*rs-bpe raw BPE only; mojo-tokenizer includes full pretokenization pipeline.

Benchmarked on Apple Silicon (M3 Ultra), sherlock.txt (607KB, 143K tokens), 20 iterations.

Full benchmarks and methodology →

Quick Start

from mojo_tokenizer import Tokenizer

# Load OpenAI's o200k_base vocabulary (GPT-4o, gpt-oss)
var tokenizer = Tokenizer.from_tiktoken("o200k_base")

# Encode text to tokens
var tokens = tokenizer.encode("Hello, world!")
print(tokens)  # [13225, 11, 2375, 0]

# Decode tokens back to text (144M tok/s)
var text = tokenizer.decode(tokens)
print(text)  # "Hello, world!"

Installation

git clone https://github.com/atsentia/mojo-tokenizer.git
cd mojo-tokenizer
mojo run bench_comprehensive.mojo  # Verify performance

Supported Formats

Format	Status	Models
o200k_base	✓ Verified	gpt-5.2, gpt-oss-120B/20B, GPT-4o
cl100k_base	✓ Verified	GPT-4, ChatGPT
HuggingFace BPE	Experimental	Qwen, Llama, Mistral

How It Works

FlatTokenStorage — Decoding at 144M tok/s via contiguous byte array:

// Traditional: 100K pointer dereferences
// FlatTokenStorage: Series of memcpy() calls
memcpy(dest, flat_data + offsets[token_id], lengths[token_id])

O(n) Backtracking BPE — Single-pass encoding with precomputed merge tables (ported from rs-bpe).

PairCache1000 — O(1) array lookup for common token pairs (+21% encoding speedup).

License

Apache 2.0

Links

Blog post with full benchmarks
atsentia/mojo-contrib — Enterprise Mojo libraries
tiktoken — OpenAI's tokenizer (reference)
rs-bpe — Rust BPE (reference)

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
benchmarks		benchmarks
data		data
docs		docs
examples		examples
src		src
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pixi.toml		pixi.toml
test_huggingface.mojo		test_huggingface.mojo
test_o200k.mojo		test_o200k.mojo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mojo-tokenizer

Performance

Quick Start

Installation

Supported Formats

How It Works

License

Links

About

Uh oh!

Releases

Packages

Languages

License

atsentia/mojo-tokenizer

Folders and files

Latest commit

History

Repository files navigation

mojo-tokenizer

Performance

Quick Start

Installation

Supported Formats

How It Works

License

Links

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages