chunk

the fastest text chunking library — up to 1 TB/s throughput

you know how every chunking library claims to be fast? yeah, we actually meant it.

chunk splits text at semantic boundaries (periods, newlines, the usual suspects) and does it stupid fast. we're talking "chunk the entire english wikipedia in 120ms" fast.

want to know how? read the blog post where we nerd out about SIMD instructions and lookup tables.

See benches/ for detailed benchmarks.

📦 Installation

cargo add chunk

looking for python or javascript?

🚀 Usage

use chunk::chunk;

let text = b"Hello world. How are you? I'm fine.\nThanks for asking.";

// With defaults (4KB chunks, split at \n . ?)
let chunks: Vec<&[u8]> = chunk(text).collect();

// With custom size
let chunks: Vec<&[u8]> = chunk(text).size(1024).collect();

// With custom delimiters
let chunks: Vec<&[u8]> = chunk(text).delimiters(b"\n.?!").collect();

// With multi-byte pattern (e.g., metaspace ▁ for SentencePiece tokenizers)
let metaspace = "▁".as_bytes();
let chunks: Vec<&[u8]> = chunk(text).pattern(metaspace).prefix().collect();

// With consecutive pattern handling (split at START of runs, not middle)
let chunks: Vec<&[u8]> = chunk(b"word   next")
    .pattern(b" ")
    .consecutive()
    .collect();

// With forward fallback (search forward if no pattern in backward window)
let chunks: Vec<&[u8]> = chunk(text)
    .pattern(b" ")
    .forward_fallback()
    .collect();

📝 Citation

If you use chunk in your research, please cite it as follows:

@software{chunk2025,
  author = {Minhas, Bhavnick},
  title = {chunk: The fastest text chunking library},
  year = {2025},
  publisher = {GitHub},
  howpublished = {\url{https://github.com/chonkie-inc/chunk}},
}

📄 License

Licensed under either of Apache License, Version 2.0 or MIT license at your option.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/workflows		.github/workflows
assets		assets
benches		benches
examples		examples
packages		packages
scripts		scripts
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

chunk

📦 Installation

🚀 Usage

📝 Citation

📄 License

About

Licenses found

Uh oh!

Releases 18

Packages

Uh oh!

Languages

License

Licenses found

chonkie-inc/chunk

Folders and files

Latest commit

History

Repository files navigation

chunk

📦 Installation

🚀 Usage

📝 Citation

📄 License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 18

Packages 0

Uh oh!

Languages

Packages