feat: add DeepSeek V3 tokenizer with ByteLevel BPE support #6

farhan-syah · 2025-11-26T19:55:16Z

Summary

Add DeepSeek V3 tokenizer with full ByteLevel BPE encoding support
Implement ByteLevelStreamingDecoder for streaming LLM output from ByteLevel tokenizers
Add comprehensive documentation and restructure README for better organization
Bump version to 0.6.0

DeepSeek V3 Tokenizer

128,000 BPE tokens with ByteLevel encoding (GPT-2 style byte-to-unicode mapping)
Native DeepSeek special tokens: <think>, </think>, <｜User｜>, <｜Assistant｜>, <|EOT|>, FIM tokens, tool calling tokens
54 Splintr agent tokens (128900-128953)
Full encode/decode roundtrip support for all UTF-8 text including Chinese, emoji, etc.

ByteLevel Streaming Decoder

New ByteLevelStreamingDecoder for token-by-token LLM output decoding
Two-stage decoding: ByteLevel → raw bytes → UTF-8 assembly
Handles incomplete UTF-8 sequences across token boundaries
Available via tokenizer.byte_level_streaming_decoder() in Python

Documentation

New /docs/api_guide.md with comprehensive API reference and examples
New /docs/bytelevel_bpe.md explaining ByteLevel encoding
Updated /docs/special_tokens.md with DeepSeek V3 tokens
Streamlined README with links to detailed docs

Add complete DeepSeek V3 tokenizer implementation with 128,000 token vocabulary and 71 special tokens (17 native + 54 agent). DeepSeek V3 uses ByteLevel BPE encoding, requiring new infrastructure for handling arbitrary byte sequences through printable Unicode characters. DeepSeek V3 features: - Native thinking tokens (<think>, </think>) for CoT reasoning - Native role tokens (<｜User｜>, <｜Assistant｜>, <|EOT|>) - FIM tokens for code completion (<｜fim▁hole｜>, etc.) - Tool calling tokens for function execution - Compatible with DeepSeek V3 and DeepSeek R1 models ByteLevel BPE infrastructure: - Implement encoder/decoder with bijective byte-to-char mapping - Extend Tokenizer with use_byte_level flag and constructors - Preserve printable ASCII/Latin-1, map control chars to U+0100+ - Add comprehensive documentation in docs/bytelevel_bpe.md - Lazy-initialized lookup tables for optimal performance Additional improvements: - Add "Exact Token ID Tests" to all tokenizers (cl100k, o200k, llama3) - Verify specific token IDs to prevent encoding/vocabulary regressions - Test cases cover: basic text, Chinese, emojis with expected IDs - Vocabulary conversion script: scripts/convert_deepseek_vocab.py - Update README and docs/special_tokens.md with DeepSeek V3 info

Add ByteLevelStreamingDecoder to handle streaming decode for tokenizers using ByteLevel BPE encoding (DeepSeek V3, GPT-2). The decoder performs two-stage decoding: first converts ByteLevel-encoded token bytes back to raw bytes, then assembles them into valid UTF-8 strings. Changes: - Implement ByteLevelStreamingDecoder in Rust core with UTF-8 buffering - Add Python bindings with full API (add_token, add_tokens, flush, reset) - Export from all module levels (src/core, src/lib, src/python, python) - Add comprehensive test suite covering ASCII, Unicode, emoji, special tokens - Document streaming decoder usage in bytelevel_bpe.md with examples

Create comprehensive API guide and streamline README for better maintainability. Move detailed API documentation, usage examples, and performance tips to a dedicated guide. Changes: - Add docs/api_guide.md with complete Python and Rust API reference - Include detailed examples for encoding, decoding, and streaming - Document both StreamingDecoder and ByteLevelStreamingDecoder usage - Add performance optimization tips and best practices - Streamline README to focus on quick start and key features - Replace verbose inline docs with links to API guide

farhan-syah added 4 commits November 27, 2025 02:50

chore: bump version to 0.6.0

b1761be

farhan-syah force-pushed the feat/add-deepseek-vocab branch from ad47a9c to b1761be Compare November 26, 2025 19:58

farhan-syah merged commit fd98ef3 into main Nov 26, 2025
5 checks passed

farhan-syah deleted the feat/add-deepseek-vocab branch December 2, 2025 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add DeepSeek V3 tokenizer with ByteLevel BPE support #6

feat: add DeepSeek V3 tokenizer with ByteLevel BPE support #6

Uh oh!

farhan-syah commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add DeepSeek V3 tokenizer with ByteLevel BPE support #6

feat: add DeepSeek V3 tokenizer with ByteLevel BPE support #6

Uh oh!

Conversation

farhan-syah commented Nov 26, 2025

Summary

DeepSeek V3 Tokenizer

ByteLevel Streaming Decoder

Documentation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants