feat: add Llama 3 tokenizer support #4

farhan-syah · 2025-11-26T12:15:31Z

Summary

Add complete Llama 3 family tokenizer support (3.0, 3.1, 3.2, 3.3)
Include all official Meta special tokens (128000-128010, 128256)
Add 54 splintr agent tokens (128300-128353) for chat, reasoning, tool use
Align <|image|> token with official Meta Llama 3.2-Vision ID (128256)
Add comprehensive test coverage for all tokenizers (Rust + Python)
Update documentation with Llama 3 usage examples and token references

Changes

Implementation: Added LLAMA3_PATTERN, bundled vocabulary, LLAMA3_AGENT_TOKENS Python API
Tests: 53 Rust tests, 91 Python tests covering all three tokenizers
Docs: Updated README.md and docs/special_tokens.md with Llama 3 documentation
Version: Bumped to 0.4.0

Add comprehensive support for Meta's Llama 3 family (3.0 through 3.3), including all official special tokens and extended agent vocabulary. Implementation: - Add LLAMA3_PATTERN constant and llama3.tiktoken vocabulary - Support "llama3", "llama3.1", "llama3.2", "llama3.3" in from_pretrained() - Add all Meta standard tokens (begin_of_text, eot_id, python_tag, etc.) - Implement 54 agent tokens starting at 128300 (avoiding Meta's reserved range) - Align <|image|> token with official Meta 3.2-Vision ID (128256) - Export LLAMA3_AGENT_TOKENS Python API with all token constants Documentation: - Add Llama 3 to supported vocabularies table - Document Meta's official tokens and chat format - Add comprehensive special_tokens.md section for Llama 3 - Include Python usage examples for both OpenAI and Llama 3 models - Document version-specific tokens (3.1 tool use, 3.2 vision) This maintains full compatibility with the Llama 3 token allocation while extending the vocabulary with splintr's agent tokens for chat, reasoning, tool calling, and RAG applications.

Add test coverage for cl100k, o200k, and llama3 tokenizers in both Rust and Python implementations. Rust tests (tests/): - Encode/decode roundtrip validation - Special token handling - Batch encoding operations - Streaming decoder functionality - Edge case handling - Agent token constant validation Python tests (python/tests/): - Mirror Rust test coverage - PyO3 binding validation - Cross-language consistency checks Also includes uv.lock for Python dependency management.

Update version across all configuration files and enhance pre-commit hook to automatically update uv.lock when pyproject.toml changes.

farhan-syah added 3 commits November 26, 2025 20:02

chore: bump version to 0.4.0

ea0a493

Update version across all configuration files and enhance pre-commit hook to automatically update uv.lock when pyproject.toml changes.

farhan-syah merged commit 09b83e0 into main Nov 26, 2025
5 checks passed

farhan-syah deleted the feat/add-llama3-support branch November 26, 2025 12:25

farhan-syah mentioned this pull request Nov 26, 2025

[Enhancement] Tokenization Correctness & Robustness Improvements #7

Open

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Llama 3 tokenizer support #4

feat: add Llama 3 tokenizer support #4

Uh oh!

farhan-syah commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add Llama 3 tokenizer support #4

feat: add Llama 3 tokenizer support #4

Uh oh!

Conversation

farhan-syah commented Nov 26, 2025

Summary

Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants