kitoken

Tokenizer for language models.

^{Tokenize text for Llama, Gemini, GPT-4, DeepSeek, Mistral and many others; in the web, on the client and any platform.}

from kitoken import Kitoken

encoder = Kitoken.from_web("hf:Qwen/Qwen3.5-9B")

tokens = encoder.encode("hello world!", True)
string = encoder.decode(tokens).decode("utf-8")

assert string == "hello world!"

Overview

Kitoken is a fast and versatile tokenizer for language models compatible with SentencePiece, HuggingFace Tokenizers, OpenAI Tiktoken and Mistral Tekken, supporting BPE, Unigram and WordPiece tokenization.

Fast and efficient tokenization
Faster than most other tokenizers in both common and uncommon scenarios; see the benchmarks for comparisons with different datasets.
Runs in all environments
Native in Rust and with bindings for Web, Node and Python; see kitoken.dev for a web demo.
Supports input and output processing
Including unicode-aware normalization, pre-tokenization and post-processing options.
Compact data encoding
Definitions are stored in an efficient binary format and without merge list.

See the main README for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kitoken

Overview

Uh oh!

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

kitoken

Overview