Tokenizer for language models.
Tokenize text for Llama, Gemini, GPT-4, DeepSeek, Mistral and many others; in the web, on the client and any platform.
from kitoken import Kitoken
encoder = Kitoken.from_web("hf:Qwen/Qwen3.5-9B")
tokens = encoder.encode("hello world!", True)
string = encoder.decode(tokens).decode("utf-8")
assert string == "hello world!"Kitoken is a fast and versatile tokenizer for language models compatible with SentencePiece, HuggingFace Tokenizers, OpenAI Tiktoken and Mistral Tekken, supporting BPE, Unigram and WordPiece tokenization.
- Fast and efficient tokenization
Faster than most other tokenizers in both common and uncommon scenarios; see the benchmarks for comparisons with different datasets. - Runs in all environments
Native in Rust and with bindings for Web, Node and Python; see kitoken.dev for a web demo. - Supports input and output processing
Including unicode-aware normalization, pre-tokenization and post-processing options. - Compact data encoding
Definitions are stored in an efficient binary format and without merge list.
See the main README for more information.