Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .python-version

This file was deleted.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "splintr"
version = "0.3.0"
version = "0.4.0"
edition = "2021"
description = "Fast Rust BPE tokenizer with Python bindings"
license = "MIT"
Expand Down
43 changes: 28 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[![Crates.io](https://img.shields.io/crates/v/splintr.svg)](https://crates.io/crates/splintr) [![PyPI](https://img.shields.io/pypi/v/splintr-rs.svg)](https://pypi.org/project/splintr-rs/) [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)

**A high-performance BPE tokenizer built in Rust with Python bindings, focused on speed, safety, and resource optimization.**
**A high-performance BPE tokenizer built with Rust with Python bindings, focused on speed, safety, and resource optimization.**

## The Problem

Expand Down Expand Up @@ -35,9 +35,12 @@ pip install splintr-rs
```python
from splintr import Tokenizer

# Load a pretrained vocabulary
# Load a pretrained vocabulary (OpenAI)
tokenizer = Tokenizer.from_pretrained("cl100k_base")

# Or load Llama 3 tokenizer (Meta) - supports all versions up to Llama 3.3
# tokenizer = Tokenizer.from_pretrained("llama3")

# Encode text to token IDs
tokens = tokenizer.encode("Hello, world!")
print(tokens) # [9906, 11, 1917, 0]
Expand All @@ -56,7 +59,7 @@ print(batch_tokens) # [[9906, 11, 1917, 0], [4438, 527, 499, 30], ...]

```toml
[dependencies]
splintr = "0.3.0"
splintr = "0.4.0"
```

```rust
Expand Down Expand Up @@ -88,7 +91,7 @@ let batch_tokens = tokenizer.encode_batch(&texts);

**Built for production:**

- **Compatible vocabularies** - Supports cl100k_base and o200k_base (OpenAI models), with a familiar API
- **Compatible vocabularies** - Supports cl100k_base, o200k_base (OpenAI), and Llama 3 family (Meta), with a familiar API
- **Streaming decoder** - Real-time LLM output display with proper UTF-8 handling
- **54 agent tokens** - Built-in support for chat, CoT reasoning, ReAct agents, tool calling, RAG citations
- **Battle-tested algorithms** - PCRE2 with JIT, Aho-Corasick for special tokens, linked-list BPE
Expand Down Expand Up @@ -242,7 +245,7 @@ print(decoder.flush())

```python
# Load pretrained model (includes vocabulary and special tokens)
tokenizer = Tokenizer.from_pretrained("cl100k_base") # or "o200k_base"
tokenizer = Tokenizer.from_pretrained("cl100k_base") # or "o200k_base", "llama3"

# Load from custom vocabulary file
tokenizer = Tokenizer(
Expand Down Expand Up @@ -295,32 +298,42 @@ See the [API documentation](https://docs.rs/splintr) for complete details.

## Supported Vocabularies

| Vocabulary | Used By | Vocabulary Size | Special Tokens | Import Constant |
| ------------- | -------------------- | --------------- | -------------- | --------------------- |
| `cl100k_base` | GPT-4, GPT-3.5-turbo | ~100,000 | 5 + 54 agent | `CL100K_BASE_PATTERN` |
| `o200k_base` | GPT-4o | ~200,000 | 2 + 54 agent | `O200K_BASE_PATTERN` |
| Vocabulary | Used By | Vocabulary Size | Special Tokens | Import Constant |
| ------------- | ----------------------------- | --------------- | -------------- | --------------------- |
| `cl100k_base` | GPT-4, GPT-3.5-turbo | ~100,000 | 5 + 54 agent | `CL100K_BASE_PATTERN` |
| `o200k_base` | GPT-4o | ~200,000 | 2 + 54 agent | `O200K_BASE_PATTERN` |
| `llama3` | Llama 3, 3.1, 3.2, 3.3 (Meta) | ~128,000 | 11 + 54 agent | `LLAMA3_PATTERN` |

**OpenAI standard tokens:**

- **cl100k_base**: `<|endoftext|>`, `<|fim_prefix|>`, `<|fim_middle|>`, `<|fim_suffix|>`, `<|endofprompt|>`
- **o200k_base**: `<|endoftext|>`, `<|endofprompt|>`

**Meta Llama 3 standard tokens:**

- **llama3**: `<|begin_of_text|>`, `<|end_of_text|>`, `<|start_header_id|>`, `<|end_header_id|>`, `<|eot_id|>`, `<|eom_id|>` (3.1+), `<|python_tag|>` (3.1+), `<|step_id|>` (3.2-Vision), `<|image|>` (3.2-Vision)

### Agent Tokens (54 per model)

Splintr extends both vocabularies with tokens for building agent systems. See [docs/special_tokens.md](docs/special_tokens.md) for complete documentation.
Splintr extends all vocabularies with tokens for building agent systems. See [docs/special_tokens.md](docs/special_tokens.md) for complete documentation.

```python
from splintr import Tokenizer, CL100K_AGENT_TOKENS
from splintr import Tokenizer, CL100K_AGENT_TOKENS, LLAMA3_AGENT_TOKENS

# OpenAI models
tokenizer = Tokenizer.from_pretrained("cl100k_base")

# Encode with special tokens
text = "<|think|>Let me reason...<|/think|>The answer is 42."
tokens = tokenizer.encode_with_special(text)

# Access token IDs programmatically
print(CL100K_AGENT_TOKENS.THINK) # 100282
print(CL100K_AGENT_TOKENS.FUNCTION) # 100292

# Llama 3 models (vocabulary includes all special tokens up to Llama 3.3)
tokenizer = Tokenizer.from_pretrained("llama3")
tokens = tokenizer.encode_with_special(text)
print(LLAMA3_AGENT_TOKENS.THINK) # 128305
print(LLAMA3_AGENT_TOKENS.FUNCTION) # 128315
print(LLAMA3_AGENT_TOKENS.BEGIN_OF_TEXT) # 128000 (official Meta token)
print(LLAMA3_AGENT_TOKENS.IMAGE) # 128256 (official Meta 3.2-Vision token)
```

| Category | Tokens | Purpose |
Expand Down
Loading
Loading