Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

kitoken

Crates.io NPM PyPI Tests & Checks

Tokenizer for language models.

Tokenize text for Llama, Gemini, GPT-4, DeepSeek, Mistral and many others; in the web, on the client and any platform.

import { Kitoken } from "kitoken/node"

const model = fs.readFileSync("models/llama4.model")
const encoder = new Kitoken(model)

const tokens = encoder.encode("hello world!", true)
const string = TextDecoder().decode(encoder.decode(tokens))

Overview

Kitoken is a fast and versatile tokenizer for language models compatible with SentencePiece, HuggingFace Tokenizers, OpenAI Tiktoken and Mistral Tekken, supporting BPE, Unigram and WordPiece tokenization.

  • Fast and efficient tokenization
    Faster than most other tokenizers in both common and uncommon scenarios; see the benchmarks for comparisons with different datasets.
  • Runs in all environments
    Native in Rust and with bindings for Web, Node and Python; see kitoken.dev for a web demo.
  • Supports input and output processing
    Including unicode-aware normalization, pre-tokenization and post-processing options.
  • Compact data encoding
    Definitions are stored in an efficient binary format and without merge list.

See the main README for more information.

Usage

The JavaScript package provides multiple exports:

Export Description
kitoken The default export, importing the WebAssembly file directly. Usable with Webpack and other bundlers.
kitoken/node Uses Node.js functions to read the WebAssembly file from the file system. Provides support for additional split strategies and regex optimizations.
kitoken/web Can be used in web browsers without a bundler, uses new URL(..., import.meta.url) to load the WebAssembly file.
kitoken/minimal Smallest file size. Similar to the default export, but only supports initialization from .kit definitions.
kitoken/full Largest file size. Similar to the default export, but provides support for additional split strategies and regex optimizations.

See also the Node test and the Web example.