Bump Version to 0.3.0 #3

farhan-syah · 2025-11-26T07:19:24Z

feat: add encode_rayon method for very large text parallelization
chore: add benchmark scripts and performance visualization images
docs: add comprehensive performance benchmarks and design rationale
chore: bump version to 0.3.0

Add encode_rayon() method to enable Rayon-based parallel encoding for texts larger than 1MB. This complements the existing encode() method which uses sequential processing optimized for typical text sizes. Key changes: - Refactor encode() to use sequential processing instead of Rayon - Add encode_rayon() for parallel encoding within a single text - Expose encode_rayon() to Python bindings - Add comprehensive documentation explaining when to use each method Performance characteristics: - encode(): ~50 MB/s sequential (optimal for texts <1MB) - encode_rayon(): ~47 MB/s parallel (beneficial only for texts >1MB) - encode_batch(): ~110 MB/s (parallelizes across multiple texts) Benchmarks show Rayon overhead is significant for typical text sizes. Sequential processing is 2-12x faster for texts under 500KB. This change ensures optimal performance for the most common use cases while providing a parallel option for very large single texts.

Add comprehensive benchmark suite comparing splintr against tiktoken, Hugging Face tokenizers, and TokenDagger across different workloads. Benchmark scripts: - benchmark_splintr.py: Internal comparison of sequential vs Rayon encoding - benchmark_single.py: Single-text encoding vs other tokenizers - benchmark_batch.py: Batch encoding throughput comparison - compare_tokenizers.py: Multi-tokenizer comparative analysis Performance visualizations: - benchmark_splintr.png: Sequential vs Rayon performance by text size - benchmark_single.png: Throughput comparison across tokenizers - benchmark_single_latency.png: Latency comparison by content type - benchmark_batch.png: Batch throughput by configuration - benchmark_batch_speedup.png: Speedup factors vs tiktoken These benchmarks demonstrate splintr's 3-4x speedup on single texts and 10-12x speedup on batch operations compared to tiktoken.

Update README with extensive benchmark data and explanation of sequential vs parallel encoding design decision. Include performance visualizations and expand API documentation. Changes: - Add Performance section with benchmark charts and data tables - Add "Design Decision: Sequential by Default" section explaining why encode() uses sequential processing instead of Rayon parallelization - Expand Python and Rust API documentation to include encode_rayon() - Rename "Supported Models" to "Supported Vocabularies" for clarity - Add Citation section with BibTeX format - Update benchmark result tables with latest data Performance highlights: - 3-4x faster than tiktoken for single-text encoding - 10-12x faster than tiktoken for batch encoding - Sequential encoding optimal for texts <1MB (typical use case) - Rayon parallelization only beneficial for texts >1MB The documentation now clearly explains when to use encode(), encode_rayon(), and encode_batch() for different workloads.

farhan-syah added 4 commits November 26, 2025 15:13

chore: bump version to 0.3.0

0dfad80

farhan-syah merged commit 03d8942 into main Nov 26, 2025
5 checks passed

farhan-syah temporarily deployed to pypi November 26, 2025 07:35 — with GitHub Actions Inactive

farhan-syah deleted the feat/encode-rayon-benchmarks branch November 26, 2025 07:40

farhan-syah mentioned this pull request Nov 26, 2025

[Enhancement] Tokenization Correctness & Robustness Improvements #7

Open

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump Version to 0.3.0 #3

Bump Version to 0.3.0 #3

Uh oh!

farhan-syah commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bump Version to 0.3.0 #3

Bump Version to 0.3.0 #3

Uh oh!

Conversation

farhan-syah commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants