Skip to content

Latest commit

 

History

History
36 lines (25 loc) · 1.97 KB

File metadata and controls

36 lines (25 loc) · 1.97 KB

kitoken

Crates.io NPM PyPI Tests & Checks

Tokenizer for language models.

Tokenize text for Llama, Gemini, GPT-4, DeepSeek, Mistral and many others; in the web, on the client and any platform.

from kitoken import Kitoken

encoder = Kitoken.from_web("hf:Qwen/Qwen3.5-9B")

tokens = encoder.encode("hello world!", True)
string = encoder.decode(tokens).decode("utf-8")

assert string == "hello world!"

Overview

Kitoken is a fast and versatile tokenizer for language models compatible with SentencePiece, HuggingFace Tokenizers, OpenAI Tiktoken and Mistral Tekken, supporting BPE, Unigram and WordPiece tokenization.

  • Fast and efficient tokenization
    Faster than most other tokenizers in both common and uncommon scenarios; see the benchmarks for comparisons with different datasets.
  • Runs in all environments
    Native in Rust and with bindings for Web, Node and Python; see kitoken.dev for a web demo.
  • Supports input and output processing
    Including unicode-aware normalization, pre-tokenization and post-processing options.
  • Compact data encoding
    Definitions are stored in an efficient binary format and without merge list.

See the main README for more information.