Skip to content

[Enhancement] Tokenization Correctness & Robustness Improvements #7

@farhan-syah

Description

@farhan-syah

Context

This issue tracks enhancements based on feedback from ReturningTarzan on Reddit:

The main thing about tokenization is always correctness. Throughput is nice but secondary.

While splintr is already fast and functional, these enhancements will ensure production-grade correctness and robustness.


Status Summary

Already Implemented ✅

Feature Status Reference
Control tokens (multiple BPE runs) ✅ Implemented src/core/tokenizer.rs:863-899 - encode_with_special() splits at special tokens and runs BPE on each segment separately
Decoding to byte strings ✅ Implemented src/core/tokenizer.rs:905 - decode_bytes(&[u32]) -> Vec<u8>
Buffer/queue for incomplete chars ✅ Implemented src/core/streaming.rs - StreamingDecoder and ByteLevelStreamingDecoder (commit 9e45c14)
Basic test coverage ✅ Implemented ~1,537 lines in tests/ covering 4 tokenizers (commit d6d836f)

Proposed Enhancements

1. Comprehensive Correctness Testing

Priority: HIGH | Complexity: MEDIUM

Current State: Integration tests cover basic functionality and exact token IDs (~1,537 lines).

Gaps:

  • Add correctness validation suite comparing against tiktoken/HuggingFace on large text corpus
  • Add property-based testing (fuzzing) with cargo-fuzz or proptest
  • Add edge case tests: malformed UTF-8, empty strings, very long inputs, pathological BPE cases
  • Document test coverage metrics and correctness validation methodology
  • Add "silent degradation" detection: tests that catch minor tokenization differences

Why it matters: LLMs are robust to small differences, but those differences can accumulate and silently degrade performance over large datasets.


2. Trimming/Padding/Normalization for Added Tokens

Priority: MEDIUM | Complexity: LOW-MEDIUM

Current State: No normalization layer exists.

Tasks:

  • Research normalization requirements for tiktoken/HuggingFace implementations
  • Determine if normalization (NFKC, whitespace trimming, etc.) is needed for correctness
  • If needed: Add optional Normalizer trait and implementations
  • Document normalization behavior (or lack thereof) in API docs
  • Add tests for special token boundary cases (leading/trailing whitespace, etc.)

3. Preprocessing/Postprocessing Validation

Priority: HIGH | Complexity: LOW

Current State: Has PCRE2 regex, ByteLevel preprocessing, Aho-Corasick. No formal validation against reference implementations.

Tasks:

  • Add validation tests comparing regex split output to tiktoken for diverse examples
  • Validate ByteLevel encode/decode against HuggingFace tokenizers
  • Test preprocessing edge cases: emoji, rare unicode, control characters
  • Document preprocessing/postprocessing guarantees in API docs

4. Ergonomic API Improvements for Single Token Decoding

Priority: LOW | Complexity: LOW

Current State: decode_bytes(&[u32]) -> Vec<u8> and streaming decoders work well, but single-token convenience methods could be added.

Tasks:

  • Consider adding decode_token_bytes(u32) -> Option<Vec<u8>> convenience method
  • Consider adding decode_token(u32) -> Option<String> method
  • Document streaming and single-token decode patterns more prominently in API guide

Success Criteria

  • Correctness validated against reference implementations (tiktoken/HuggingFace) on large sample sets
  • Zero tokenization differences from reference implementations for standard inputs
  • Property-based testing catches edge cases automatically
  • All preprocessing/postprocessing validated and documented

Resources

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions