Skip to content

Conversation

@farhan-syah
Copy link
Collaborator

Summary

  • Add complete Llama 3 family tokenizer support (3.0, 3.1, 3.2, 3.3)
  • Include all official Meta special tokens (128000-128010, 128256)
  • Add 54 splintr agent tokens (128300-128353) for chat, reasoning, tool use
  • Align <|image|> token with official Meta Llama 3.2-Vision ID (128256)
  • Add comprehensive test coverage for all tokenizers (Rust + Python)
  • Update documentation with Llama 3 usage examples and token references

Changes

  • Implementation: Added LLAMA3_PATTERN, bundled vocabulary, LLAMA3_AGENT_TOKENS Python API
  • Tests: 53 Rust tests, 91 Python tests covering all three tokenizers
  • Docs: Updated README.md and docs/special_tokens.md with Llama 3 documentation
  • Version: Bumped to 0.4.0

Add comprehensive support for Meta's Llama 3 family (3.0 through 3.3),
including all official special tokens and extended agent vocabulary.

Implementation:
- Add LLAMA3_PATTERN constant and llama3.tiktoken vocabulary
- Support "llama3", "llama3.1", "llama3.2", "llama3.3" in from_pretrained()
- Add all Meta standard tokens (begin_of_text, eot_id, python_tag, etc.)
- Implement 54 agent tokens starting at 128300 (avoiding Meta's reserved range)
- Align <|image|> token with official Meta 3.2-Vision ID (128256)
- Export LLAMA3_AGENT_TOKENS Python API with all token constants

Documentation:
- Add Llama 3 to supported vocabularies table
- Document Meta's official tokens and chat format
- Add comprehensive special_tokens.md section for Llama 3
- Include Python usage examples for both OpenAI and Llama 3 models
- Document version-specific tokens (3.1 tool use, 3.2 vision)

This maintains full compatibility with the Llama 3 token allocation while
extending the vocabulary with splintr's agent tokens for chat, reasoning,
tool calling, and RAG applications.
Add test coverage for cl100k, o200k, and llama3 tokenizers in both
Rust and Python implementations.

Rust tests (tests/):
- Encode/decode roundtrip validation
- Special token handling
- Batch encoding operations
- Streaming decoder functionality
- Edge case handling
- Agent token constant validation

Python tests (python/tests/):
- Mirror Rust test coverage
- PyO3 binding validation
- Cross-language consistency checks

Also includes uv.lock for Python dependency management.
Update version across all configuration files and enhance pre-commit hook to automatically update uv.lock when pyproject.toml changes.
@farhan-syah farhan-syah merged commit 09b83e0 into main Nov 26, 2025
5 checks passed
@farhan-syah farhan-syah deleted the feat/add-llama3-support branch November 26, 2025 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants