Tokenizer for language models.
Tokenize text for Llama, Gemini, GPT-4, DeepSeek, Mistral and many others; in the web, on the client and any platform.
import { Kitoken } from "kitoken/node"
const model = fs.readFileSync("models/llama4.model")
const encoder = new Kitoken(model)
const tokens = encoder.encode("hello world!", true)
const string = TextDecoder().decode(encoder.decode(tokens))Kitoken is a fast and versatile tokenizer for language models compatible with SentencePiece, HuggingFace Tokenizers, OpenAI Tiktoken and Mistral Tekken, supporting BPE, Unigram and WordPiece tokenization.
- Fast and efficient tokenization
Faster than most other tokenizers in both common and uncommon scenarios; see the benchmarks for comparisons with different datasets. - Runs in all environments
Native in Rust and with bindings for Web, Node and Python; see kitoken.dev for a web demo. - Supports input and output processing
Including unicode-aware normalization, pre-tokenization and post-processing options. - Compact data encoding
Definitions are stored in an efficient binary format and without merge list.
See the main README for more information.
The JavaScript package provides multiple exports:
| Export | Description |
|---|---|
kitoken |
The default export, importing the WebAssembly file directly. Usable with Webpack and other bundlers. |
kitoken/node |
Uses Node.js functions to read the WebAssembly file from the file system. Provides support for additional split strategies and regex optimizations. |
kitoken/web |
Can be used in web browsers without a bundler, uses new URL(..., import.meta.url) to load the WebAssembly file. |
kitoken/minimal |
Smallest file size. Similar to the default export, but only supports initialization from .kit definitions. |
kitoken/full |
Largest file size. Similar to the default export, but provides support for additional split strategies and regex optimizations. |
See also the Node test and the Web example.