Skip to content

Conversation

@farhan-syah
Copy link
Collaborator

feat: add encode_rayon method for very large text parallelization
chore: add benchmark scripts and performance visualization images
docs: add comprehensive performance benchmarks and design rationale
chore: bump version to 0.3.0

Add encode_rayon() method to enable Rayon-based parallel encoding for texts
larger than 1MB. This complements the existing encode() method which uses
sequential processing optimized for typical text sizes.

Key changes:
- Refactor encode() to use sequential processing instead of Rayon
- Add encode_rayon() for parallel encoding within a single text
- Expose encode_rayon() to Python bindings
- Add comprehensive documentation explaining when to use each method

Performance characteristics:
- encode(): ~50 MB/s sequential (optimal for texts <1MB)
- encode_rayon(): ~47 MB/s parallel (beneficial only for texts >1MB)
- encode_batch(): ~110 MB/s (parallelizes across multiple texts)

Benchmarks show Rayon overhead is significant for typical text sizes. Sequential
processing is 2-12x faster for texts under 500KB. This change ensures optimal
performance for the most common use cases while providing a parallel option
for very large single texts.
Add comprehensive benchmark suite comparing splintr against tiktoken,
Hugging Face tokenizers, and TokenDagger across different workloads.

Benchmark scripts:
- benchmark_splintr.py: Internal comparison of sequential vs Rayon encoding
- benchmark_single.py: Single-text encoding vs other tokenizers
- benchmark_batch.py: Batch encoding throughput comparison
- compare_tokenizers.py: Multi-tokenizer comparative analysis

Performance visualizations:
- benchmark_splintr.png: Sequential vs Rayon performance by text size
- benchmark_single.png: Throughput comparison across tokenizers
- benchmark_single_latency.png: Latency comparison by content type
- benchmark_batch.png: Batch throughput by configuration
- benchmark_batch_speedup.png: Speedup factors vs tiktoken

These benchmarks demonstrate splintr's 3-4x speedup on single texts and
10-12x speedup on batch operations compared to tiktoken.
Update README with extensive benchmark data and explanation of sequential
vs parallel encoding design decision. Include performance visualizations
and expand API documentation.

Changes:
- Add Performance section with benchmark charts and data tables
- Add "Design Decision: Sequential by Default" section explaining why
  encode() uses sequential processing instead of Rayon parallelization
- Expand Python and Rust API documentation to include encode_rayon()
- Rename "Supported Models" to "Supported Vocabularies" for clarity
- Add Citation section with BibTeX format
- Update benchmark result tables with latest data

Performance highlights:
- 3-4x faster than tiktoken for single-text encoding
- 10-12x faster than tiktoken for batch encoding
- Sequential encoding optimal for texts <1MB (typical use case)
- Rayon parallelization only beneficial for texts >1MB

The documentation now clearly explains when to use encode(), encode_rayon(),
and encode_batch() for different workloads.
@farhan-syah farhan-syah merged commit 03d8942 into main Nov 26, 2025
5 checks passed
@farhan-syah farhan-syah deleted the feat/encode-rayon-benchmarks branch November 26, 2025 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants