Skip to content

Conversation

@eureka928
Copy link
Contributor

Summary

This PR adds token-based chunking support to chunk_by_title() and chunk_elements() using tiktoken, allowing users to specify max_tokens instead of max_characters for better alignment with LLM token limits.

Closes #4127

Changes

New Parameters

Parameter Description
max_tokens Hard maximum chunk token count (mutually exclusive with max_characters)
new_after_n_tokens Soft maximum - start new chunk after this many tokens
tokenizer Tokenizer name - accepts encoding names ("cl100k_base") or model names ("gpt-4")

Implementation Details

  • TokenCounter class: Lazy tiktoken integration - only imports tiktoken when token counting is first used
  • Measurement abstraction: Added measure() method to ChunkingOptions that returns chars or tokens based on mode
  • Mutual exclusivity: max_tokens and max_characters cannot be used together
  • Token-based text splitting: New _split_by_tokens() method uses separator preferences with binary search fallback

Files Changed

  • requirements/extra-chunking-tokens.in - New tiktoken dependency
  • setup.py - Added chunking-tokens extra
  • unstructured/chunking/base.py - Core token-based chunking logic
  • unstructured/chunking/title.py - Updated chunk_by_title() signature
  • unstructured/chunking/basic.py - Updated chunk_elements() signature
  • test_unstructured/chunking/test_base.py - Unit tests
  • test_unstructured/chunking/test_title.py - Integration tests

Usage

from unstructured.chunking.title import chunk_by_title

# Token-based chunking (new)
chunks = chunk_by_title(
    elements,
    max_tokens=512,
    new_after_n_tokens=400,
    tokenizer="gpt-4"  # or "cl100k_base"
)

# Character-based chunking (unchanged)
chunks = chunk_by_title(
    elements,
    max_characters=1500,
    new_after_n_chars=1000
)

Installation

To use token-based chunking, install with the new extra:

pip install "unstructured[chunking-tokens]"

- Add TokenCounter class for lazy tiktoken integration
- Add max_tokens, new_after_n_tokens, tokenizer parameters
- Add use_token_counting property and measure() method
- Update PreChunkBuilder to use measurement abstraction
- Update _TextSplitter with _split_by_tokens() method
- Update _TableChunker to use measure() for size comparison
- Add max_tokens, new_after_n_tokens, tokenizer parameters
- Update docstring with new parameter descriptions
- Add max_tokens, new_after_n_tokens, tokenizer parameters
- Update docstring with new parameter descriptions
- Add TokenCounter unit tests
- Add ChunkingOptions token validation tests
- Add _TextSplitter token mode tests
- Add chunk_by_title integration tests for token-based chunking
@eureka928 eureka928 force-pushed the feat/token-based-chunking branch from 40a820c to c0ab0fe Compare January 21, 2026 22:09
@eureka928
Copy link
Contributor Author

@ryannikolaidis @jer @badGarnet @qued
Would you review this PR?
Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support token-based chunking in chunk_by_title using tiktoken

1 participant