"I like them big, I like them CONTRIBUTING" ~ Moto Moto, probably
Welcome fellow CHONKer! We're thrilled you want to contribute to Chonkie. Every contributionβwhether fixing bugs, adding features, or improving documentationβmakes Chonkie better for everyone.
- Check existing issues or open a new one to start a discussion
- Read Chonkie's documentation and core concepts
- Set up your development environment using the guide below
# 1. Fork and clone the repository
git clone https://github.com/chonkie-inc/chonkie.git
cd chonkie
# 2. Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3. Install dependencies (choose one)
pip install -e ".[dev]" # Base development setup
pip install -e ".[dev,semantic]" # If working on semantic features
pip install -e ".[dev,all]" # For all featurespytest # Run all tests
pytest tests/test_token_chunker.py # Run specific test file
pytest --cov=chonkie # Run tests with coverageWe use ruff for formatting and linting:
ruff check . # Check code quality
ruff check --fix . # Auto-fix issues where possibleOur style configuration enforces:
- Code formatting (
F) - Import sorting (
I) - Documentation style (
D) - Docstring coverage (
DOC)
We follow Google-style docstrings:
def chunk_text(text: str, chunk_size: int = 512) -> List[str]:
"""Split text into chunks of specified size.
Args:
text: Input text to chunk
chunk_size: Maximum size of each chunk
Returns:
List of text chunks
Raises:
ValueError: If chunk_size <= 0
"""
passsrc/
βββ chonkie/
βββ chunker/ # Chunking implementations
βββ embeddings/ # Embedding implementations
βββ refinery/ # Refinement utilities
Start with issues labeled good-first-issue
- Improve existing docs
- Add examples or tutorials
- Fix typos
- Implement new chunking strategies
- Add tokenizer support
- Optimize existing chunkers
- Improve test coverage
- Profile and optimize code
- Add benchmarks
- Improve memory usage
Look for issues with [FEAT] labels, especially those from Chonkie Maintainers
feature/descriptionfor new featuresfix/descriptionfor bug fixesdocs/descriptionfor documentation changes
Write clear, descriptive commit messages:
feat: add batch processing to WordChunker
- Implement batch_process method
- Add tests for batch processing
- Update documentation
- Core dependencies go in
project.dependencies - Optional features go in
project.optional-dependencies - Development tools go in the
devoptional dependency group
- Make sure your PR is for the
developmentbranch - All PRs need at least one review
- Maintainers will review for:
- Code quality (via ruff)
- Test coverage
- Performance impact
- Documentation completeness
Chonkie does not follow strict semantic versioning. We follow the following rules:
- 'MAJOR' version when we refactor/rewrite large parts of the codebase
- 'MINOR' version when we add breaking changes (e.g. changing a public API)
- 'PATCH' version when we add non-breaking features (e.g. adding a new chunker) or fix bugs
Current development dependencies (as of April 9, 2025):
[project.optional-dependencies]
dev = [
"tiktoken>=0.5.0",
"datasets>=1.14.0",
"transformers>=4.0.0",
"pytest>=6.2.0",
"pytest-cov>=4.0.0",
"pytest-xdist>=2.5.0",
"coverage",
"ruff>=0.0.265",
"mypy>=1.11.0"
]model2vec: For model2vec embeddingsst: For sentence-transformersopenai: For OpenAI embeddingscohere: For Cohere embeddingssemantic: For semantic featuresall: All optional dependencies
- Questions? Open an issue or ask in Discord
- Bugs? Open an issue or report in Discord
- Chat? Join our Discord!
- Email? Contact support@chonkie.ai
Every contribution helps make Chonkie better! We appreciate your time and effort in helping make Chonkie the CHONKiest it can be!
Remember:
"A journey of a thousand CHONKs begins with a single commit" ~ Ancient Proverb, probably