Token-Based Chunking Support #4203

eureka928 · 2026-01-21T22:04:24Z

Summary

This PR adds token-based chunking support to chunk_by_title() and chunk_elements() using tiktoken, allowing users to specify max_tokens instead of max_characters for better alignment with LLM token limits.

Closes #4127

Changes

New Parameters

Parameter	Description
`max_tokens`	Hard maximum chunk token count (mutually exclusive with `max_characters`)
`new_after_n_tokens`	Soft maximum - start new chunk after this many tokens
`tokenizer`	Tokenizer name - accepts encoding names (`"cl100k_base"`) or model names (`"gpt-4"`)

Implementation Details

TokenCounter class: Lazy tiktoken integration - only imports tiktoken when token counting is first used
Measurement abstraction: Added measure() method to ChunkingOptions that returns chars or tokens based on mode
Mutual exclusivity: max_tokens and max_characters cannot be used together
Token-based text splitting: New _split_by_tokens() method uses separator preferences with binary search fallback

Files Changed

requirements/extra-chunking-tokens.in - New tiktoken dependency
setup.py - Added chunking-tokens extra
unstructured/chunking/base.py - Core token-based chunking logic
unstructured/chunking/title.py - Updated chunk_by_title() signature
unstructured/chunking/basic.py - Updated chunk_elements() signature
test_unstructured/chunking/test_base.py - Unit tests
test_unstructured/chunking/test_title.py - Integration tests

Usage

from unstructured.chunking.title import chunk_by_title

# Token-based chunking (new)
chunks = chunk_by_title(
    elements,
    max_tokens=512,
    new_after_n_tokens=400,
    tokenizer="gpt-4"  # or "cl100k_base"
)

# Character-based chunking (unchanged)
chunks = chunk_by_title(
    elements,
    max_characters=1500,
    new_after_n_chars=1000
)

Installation

To use token-based chunking, install with the new extra:

pip install "unstructured[chunking-tokens]"

- Add TokenCounter class for lazy tiktoken integration - Add max_tokens, new_after_n_tokens, tokenizer parameters - Add use_token_counting property and measure() method - Update PreChunkBuilder to use measurement abstraction - Update _TextSplitter with _split_by_tokens() method - Update _TableChunker to use measure() for size comparison

- Add max_tokens, new_after_n_tokens, tokenizer parameters - Update docstring with new parameter descriptions

- Add TokenCounter unit tests - Add ChunkingOptions token validation tests - Add _TextSplitter token mode tests - Add chunk_by_title integration tests for token-based chunking

eureka928 · 2026-01-21T22:11:22Z

@ryannikolaidis @jer @badGarnet @qued
Would you review this PR?
Thank you

eureka928 added 6 commits January 21, 2026 23:08

feat: add tiktoken dependency for token-based chunking

5cd4d46

feat: add token-based parameters to chunk_by_title()

1d93bef

- Add max_tokens, new_after_n_tokens, tokenizer parameters - Update docstring with new parameter descriptions

feat: add token-based parameters to chunk_elements()

ff00665

- Add max_tokens, new_after_n_tokens, tokenizer parameters - Update docstring with new parameter descriptions

test: add tests for token-based chunking

377309d

- Add TokenCounter unit tests - Add ChunkingOptions token validation tests - Add _TextSplitter token mode tests - Add chunk_by_title integration tests for token-based chunking

docs: update CHANGELOG for token-based chunking feature

c0ab0fe

eureka928 force-pushed the feat/token-based-chunking branch from 40a820c to c0ab0fe Compare January 21, 2026 22:09

eureka928 added 2 commits January 21, 2026 23:18

fix: remove duplicate TokenCounter imports in tests

01b9481

style: apply black formatting

abf7488

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Token-Based Chunking Support #4203

Token-Based Chunking Support #4203

eureka928 commented Jan 21, 2026

Uh oh!

eureka928 commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Token-Based Chunking Support #4203

Are you sure you want to change the base?

Token-Based Chunking Support #4203

Conversation

eureka928 commented Jan 21, 2026

Summary

Changes

New Parameters

Implementation Details

Files Changed

Usage

Installation

Uh oh!

eureka928 commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant