Skip to content

Conversation

@farhan-syah
Copy link
Collaborator

Summary

  • Add DeepSeek V3 tokenizer with full ByteLevel BPE encoding support
  • Implement ByteLevelStreamingDecoder for streaming LLM output from ByteLevel tokenizers
  • Add comprehensive documentation and restructure README for better organization
  • Bump version to 0.6.0

DeepSeek V3 Tokenizer

  • 128,000 BPE tokens with ByteLevel encoding (GPT-2 style byte-to-unicode mapping)
  • Native DeepSeek special tokens: <think>, </think>, <|User|>, <|Assistant|>, <|EOT|>, FIM tokens, tool calling tokens
  • 54 Splintr agent tokens (128900-128953)
  • Full encode/decode roundtrip support for all UTF-8 text including Chinese, emoji, etc.

ByteLevel Streaming Decoder

  • New ByteLevelStreamingDecoder for token-by-token LLM output decoding
  • Two-stage decoding: ByteLevel → raw bytes → UTF-8 assembly
  • Handles incomplete UTF-8 sequences across token boundaries
  • Available via tokenizer.byte_level_streaming_decoder() in Python

Documentation

  • New /docs/api_guide.md with comprehensive API reference and examples
  • New /docs/bytelevel_bpe.md explaining ByteLevel encoding
  • Updated /docs/special_tokens.md with DeepSeek V3 tokens
  • Streamlined README with links to detailed docs

Add complete DeepSeek V3 tokenizer implementation with 128,000 token
vocabulary and 71 special tokens (17 native + 54 agent). DeepSeek V3
uses ByteLevel BPE encoding, requiring new infrastructure for handling
arbitrary byte sequences through printable Unicode characters.

DeepSeek V3 features:
- Native thinking tokens (<think>, </think>) for CoT reasoning
- Native role tokens (<|User|>, <|Assistant|>, <|EOT|>)
- FIM tokens for code completion (<|fim▁hole|>, etc.)
- Tool calling tokens for function execution
- Compatible with DeepSeek V3 and DeepSeek R1 models

ByteLevel BPE infrastructure:
- Implement encoder/decoder with bijective byte-to-char mapping
- Extend Tokenizer with use_byte_level flag and constructors
- Preserve printable ASCII/Latin-1, map control chars to U+0100+
- Add comprehensive documentation in docs/bytelevel_bpe.md
- Lazy-initialized lookup tables for optimal performance

Additional improvements:
- Add "Exact Token ID Tests" to all tokenizers (cl100k, o200k, llama3)
- Verify specific token IDs to prevent encoding/vocabulary regressions
- Test cases cover: basic text, Chinese, emojis with expected IDs
- Vocabulary conversion script: scripts/convert_deepseek_vocab.py
- Update README and docs/special_tokens.md with DeepSeek V3 info
Add ByteLevelStreamingDecoder to handle streaming decode for tokenizers
using ByteLevel BPE encoding (DeepSeek V3, GPT-2). The decoder performs
two-stage decoding: first converts ByteLevel-encoded token bytes back to
raw bytes, then assembles them into valid UTF-8 strings.

Changes:
- Implement ByteLevelStreamingDecoder in Rust core with UTF-8 buffering
- Add Python bindings with full API (add_token, add_tokens, flush, reset)
- Export from all module levels (src/core, src/lib, src/python, python)
- Add comprehensive test suite covering ASCII, Unicode, emoji, special tokens
- Document streaming decoder usage in bytelevel_bpe.md with examples
Create comprehensive API guide and streamline README for better
maintainability. Move detailed API documentation, usage examples,
and performance tips to a dedicated guide.

Changes:
- Add docs/api_guide.md with complete Python and Rust API reference
- Include detailed examples for encoding, decoding, and streaming
- Document both StreamingDecoder and ByteLevelStreamingDecoder usage
- Add performance optimization tips and best practices
- Streamline README to focus on quick start and key features
- Replace verbose inline docs with links to API guide
@farhan-syah farhan-syah force-pushed the feat/add-deepseek-vocab branch from ad47a9c to b1761be Compare November 26, 2025 19:58
@farhan-syah farhan-syah merged commit fd98ef3 into main Nov 26, 2025
5 checks passed
@farhan-syah farhan-syah deleted the feat/add-deepseek-vocab branch December 2, 2025 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants