-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Context
This issue tracks enhancements based on feedback from ReturningTarzan on Reddit:
The main thing about tokenization is always correctness. Throughput is nice but secondary.
While splintr is already fast and functional, these enhancements will ensure production-grade correctness and robustness.
Status Summary
Already Implemented ✅
| Feature | Status | Reference |
|---|---|---|
| Control tokens (multiple BPE runs) | ✅ Implemented | src/core/tokenizer.rs:863-899 - encode_with_special() splits at special tokens and runs BPE on each segment separately |
| Decoding to byte strings | ✅ Implemented | src/core/tokenizer.rs:905 - decode_bytes(&[u32]) -> Vec<u8> |
| Buffer/queue for incomplete chars | ✅ Implemented | src/core/streaming.rs - StreamingDecoder and ByteLevelStreamingDecoder (commit 9e45c14) |
| Basic test coverage | ✅ Implemented | ~1,537 lines in tests/ covering 4 tokenizers (commit d6d836f) |
Proposed Enhancements
1. Comprehensive Correctness Testing
Priority: HIGH | Complexity: MEDIUM
Current State: Integration tests cover basic functionality and exact token IDs (~1,537 lines).
Gaps:
- Add correctness validation suite comparing against tiktoken/HuggingFace on large text corpus
- Add property-based testing (fuzzing) with
cargo-fuzzorproptest - Add edge case tests: malformed UTF-8, empty strings, very long inputs, pathological BPE cases
- Document test coverage metrics and correctness validation methodology
- Add "silent degradation" detection: tests that catch minor tokenization differences
Why it matters: LLMs are robust to small differences, but those differences can accumulate and silently degrade performance over large datasets.
2. Trimming/Padding/Normalization for Added Tokens
Priority: MEDIUM | Complexity: LOW-MEDIUM
Current State: No normalization layer exists.
Tasks:
- Research normalization requirements for tiktoken/HuggingFace implementations
- Determine if normalization (NFKC, whitespace trimming, etc.) is needed for correctness
- If needed: Add optional
Normalizertrait and implementations - Document normalization behavior (or lack thereof) in API docs
- Add tests for special token boundary cases (leading/trailing whitespace, etc.)
3. Preprocessing/Postprocessing Validation
Priority: HIGH | Complexity: LOW
Current State: Has PCRE2 regex, ByteLevel preprocessing, Aho-Corasick. No formal validation against reference implementations.
Tasks:
- Add validation tests comparing regex split output to tiktoken for diverse examples
- Validate ByteLevel encode/decode against HuggingFace tokenizers
- Test preprocessing edge cases: emoji, rare unicode, control characters
- Document preprocessing/postprocessing guarantees in API docs
4. Ergonomic API Improvements for Single Token Decoding
Priority: LOW | Complexity: LOW
Current State: decode_bytes(&[u32]) -> Vec<u8> and streaming decoders work well, but single-token convenience methods could be added.
Tasks:
- Consider adding
decode_token_bytes(u32) -> Option<Vec<u8>>convenience method - Consider adding
decode_token(u32) -> Option<String>method - Document streaming and single-token decode patterns more prominently in API guide
Success Criteria
- Correctness validated against reference implementations (tiktoken/HuggingFace) on large sample sets
- Zero tokenization differences from reference implementations for standard inputs
- Property-based testing catches edge cases automatically
- All preprocessing/postprocessing validated and documented
Resources
- Original Reddit feedback: https://www.reddit.com/r/LocalLLaMA/comments/1p71luf/comment/nqv7upz/
- Tiktoken (reference): https://github.com/openai/tiktoken
- HuggingFace tokenizers: https://github.com/huggingface/tokenizers