ccontext (collect-context) is a cross-platform utility designed to streamline the process of gathering and sending the context of a directory to large language models (LLMs) like ChatGPT-4o. Our mission is to make collecting and sending context to an LLM as easy as possible.
ccontext_demo.mov
Features
- 🌟 Easy Setup: Quick installation and configuration.
- 🌍 Cross-Platform Support: Supports Windows, macOS, and Linux.
- 💾 Binary File Support: Handle various binary files including PDFs, Word documents, images, audio, and video files.
- 📄 Markdown and PDF Generation: Generate detailed Markdown and PDF files of the directory structure and file contents.
- 🌐 Crawling of (documentation) Sites: Crawl and gather data from multiple sites using a specified list of URLs.
- ✂️ Tokenization and Chunking: Automatically handles tokenization and chunking to stay within LLM token limits.
- 🔧 Configurable Exclusions and Inclusions: Flexibly specify which files and directories to include or exclude.
- 🗣️ Verbose Output: Optional verbose mode for detailed output and debugging.
- 📝 Prompt Templates (Upcoming): Create and use custom templates for different types of prompts.
- Installation
- Usage
- Configuration
- Binary File Handling
- Document Crawling
- Use Cases and Examples
- Troubleshooting
- Development Guide
We recommend installing ccontext using pipx. pipx is a tool that lets you install and run Python applications in isolated environments, ensuring clean installation and easy management of CLI applications.
-
First, install pipx if you haven't already:
# On macOS brew install pipx pipx ensurepath # On Ubuntu/Debian sudo apt install pipx pipx ensurepath # On Windows python -m pip install --user pipx python -m pipx ensurepath # or read https://pipx.pypa.io/stable/installation/#on-windows
-
Install ccontext using pipx:
pipx install ccontext
Why use pipx?
- Isolated Environment: Each application runs in its own virtual environment
- No Dependency Conflicts: Avoids conflicts with other Python packages
- Easy Updates: Simple command to upgrade (
pipx upgrade ccontext) - Clean Uninstallation: Remove everything with one command (
pipx uninstall ccontext) - Global Access: Installed applications are available system-wide
If you prefer to install from source:
-
Clone the repository:
git clone https://github.com/oxillix/ccontext.git cd ccontext -
Set up a virtual environment:
python3 -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt
-
Install the package:
pip install .
-
Run
ccontextin the folder to ccollect with default settings defined in~/.ccontext/config.json:ccontext
-
Specify a root path, exclusions, and inclusions:
ccontext -p /path/to/directory -e ".git|node_modules" -i "important_file.txt|docs"
-h, --help: Show help message.-p, --root_path: The root path to start the directory tree (default: current directory).-e, --excludes: Additional files or directories to exclude, separated by|, e.g.,node_modules|.git.-i, --includes: Files or directories to include, separated by|, e.g.,important_file.txt|docs.-m, --max_tokens: Maximum number of tokens allowed before chunking.-c, --config: Path to a custom configuration file.-v, --verbose: Enable verbose output to stdout.-ig, --ignore_gitignore: Ignore the.gitignorefile for exclusions.-g, --generate-pdf: Generate a PDF of the directory tree and file contents.-gm, --generate-md: Generate a Markdown file of the directory tree and file contents.--crawl: Crawls the sites specified in the config.
ccontext -p /home/user/project -e ".git|build" -i "README.md|src"ccontext looks for configuration in the following order:
- Custom config file specified via
-cargument .ccontext-config.jsonin the current directory- If present, ccontext will automatically detect and use this local configuration file
- Create this file in the same directory where you run the ccontext command
~/.ccontext/config.json(default user configuration)
{
"verbose": false, // Enable detailed output
"max_tokens": 115000, // Maximum tokens before chunking
"model_type": "gpt-4o", // LLM model type for tokenization
"buffer_size": 0.05, // Token buffer size (0-1)
// System prompt for LLM context
"context_prompt": "[[SYSTEM INSTRUCTIONS]] The following output represents...",
// Web crawler configuration
"urls_to_crawl": [
{
"url": "https://www.django-rest-framework.org/",
"match": ["https://www.django-rest-framework.org/**"],
"exclude": ["https://www.django-rest-framework.org/community/**"],
"selector": "",
"maxPagesToCrawl": 100,
"outputFileName": "django-rest-framework.org.json",
"maxTokens": 10000000
}
],
// Files/folders to explicitly include
"included_folders_files": [],
// Files/folders to exclude (supports glob patterns)
"excluded_folders_files": [
"**/.git",
"**/bin",
"**/build",
"**/node_modules/**",
"**/venv",
"**/__pycache__",
"**/package-lock.json",
"**/ccontext.egg-info",
"**/dist",
"**/__tests__",
"**/coverage",
"**/.next",
"**/pnpm-lock.yaml",
"**/poetry.lock",
"**/ccontext-output.pdf",
"**/ccontext-output.md",
"**/*.phpstorm.meta.php",
"**/*.min.js",
"**/composer.lock",
"**/*.lock",
"**/vendor",
"**/laravel_access.log",
"**/*.DS_Store",
"**/*.tox"
],
// File extensions that can be uploaded to LLMs
"uploadable_extensions": [
// Documents
".pdf",
".doc",
".docx",
".xls",
".xlsx",
".ppt",
".pptx",
// Images
".jpg",
".jpeg",
".png",
".gif",
".bmp",
".tiff",
".webp",
".heic",
// Audio
".mp3",
".wav",
".ogg",
".flac",
".aac",
".m4a",
// Video
".mp4",
".mkv",
".avi",
".mov",
".wmv",
".webm",
// Archives
".zip",
".rar",
".7z",
".tar",
".gz",
// Binary/System
".exe",
".dll",
".iso",
".dmg",
".bin",
".dat",
".apk",
".img",
".so",
".swf",
".psd"
]
}ccontext uses the wcmatch library for glob pattern matching, which gives you powerful but easy-to-use file matching capabilities. Here's a simple guide to using glob patterns:
-
Important Wildcards Explained:
-
*(single star): Matches anything in the current folder only"*.txt" # Matches: a.txt, b.txt (in current folder) "*.txt" # Won't match: sub/a.txt, deep/sub/b.txt -
**(double star): Matches any number of folders"**/temp" # Matches: temp, sub/temp, deep/sub/temp "**/temp" # Won't match: temp/file.txt -
**/*(double star slash star): Matches everything in all folders"**/*.txt" # Matches: a.txt, sub/b.txt, very/deep/c.txt "**/*" # Matches everything, everywhere -
?matches any single character -
.txtmatches exact file extension
-
-
Simple Examples:
{ "excluded_folders_files": [ // Basic matching "temp.txt", // Matches exact file temp.txt "*.txt", // Matches all .txt files in root folder "**/*.txt", // Matches all .txt files in any folder // Folder matching "temp/*", // Matches everything in temp folder "**/temp", // Matches temp folder anywhere "**/temp/**", // Matches everything in any temp folder // Common use cases "**/node_modules", // Matches node_modules folders anywhere "**/__pycache__", // Matches Python cache folders "**/*.pyc", // Matches Python compiled files "build/*" // Matches everything in build folder ] } -
Tips for Beginners:
- Start simple! Use
*.extfor file extensions - Use
**/when you want to match in any folder - Test your patterns with a small folder first
- When in doubt, be more specific
- Remember, patterns are case-sensitive
- Start simple! Use
The glob system is very forgiving - if you make a mistake, it usually just won't match anything rather than causing errors. Feel free to experiment!
| Option | Description | Default |
|---|---|---|
| verbose | Enable detailed output | false |
| max_tokens | Maximum tokens before chunking | 115000 |
| model_type | LLM model type for tokenization | "gpt-4o" |
| buffer_size | Token buffer size (0-1) | 0.05 |
| excluded_folders_files | Glob patterns for exclusion | [".git", ...] |
| included_folders_files | Glob patterns for inclusion | [] |
| uploadable_extensions | File extensions to upload | [".pdf", ...] |
ccontext supports handling binary files through the uploadable_extensions configuration.
- Documents:
.pdf,.doc,.docx,.xls,.xlsx,.ppt,.pptx - Images:
.jpg,.jpeg,.png,.gif,.bmp,.tiff,.webp,.heic - Audio:
.mp3,.wav,.ogg,.flac,.aac,.m4a - Video:
.mp4,.mkv,.avi,.mov,.wmv,.webm - Archives:
.zip,.rar,.7z,.tar,.gz - Binary/System:
.exe,.dll,.iso,.dmg,.bin,.dat,.apk,.img,.so,.swf,.psd
- Binary files matching
uploadable_extensionsare prepared for upload to LLMs - File references are automatically copied to clipboard
- Most LLM providers limit maximum of X binary files per prompt
- Rate limits may apply based on your LLM provider
Example configuration for handling specific file types:
{
"uploadable_extensions": [".pdf", ".jpg", ".png", ".xlsx"]
}The crawling feature allows you to gather documentation from websites for context.
{
"urls_to_crawl": [
{
"url": "https://docs.example.com",
"match": ["https://docs.example.com/**"],
"exclude": ["https://docs.example.com/internal/**"],
"selector": "",
"maxPagesToCrawl": 100,
"outputFileName": "docs.json",
"maxTokens": 2000000
}
]
}- url: Starting URL for crawling
- match: Glob patterns for URLs to include
- exclude: Glob patterns for URLs to exclude
- selector: CSS selector for content extraction
- maxPagesToCrawl: Limit on pages to crawl
- outputFileName: Name of output file
- maxTokens: Maximum tokens to collect
- Use specific
matchpatterns - Respect robots.txt and site policies
- Analyzing a Python Project
ccontext -p /path/to/project -e "venv|__pycache__|*.pyc"- Processing Documentation
ccontext -p ./docs --crawl -gm- Including Specific Files
ccontext -i "README.md|docs/*|*.py"- Generating PDF and Markdown
ccontext -g -gm # Generates both PDF and Markdown- With GitHub Copilot
ccontext -p . -e "node_modules|dist" -i "src/**/*.ts"- **With ChatGPT (webapp has max 32k) **
ccontext -p . --max_tokens 32000-
Clipboard Issues in SSH
- Issue: Cannot copy to clipboard in SSH session
- Solution:
- Use SSH with X11 forwarding (
ssh -X user@host), test using xeyes - On Mac, install XQuartz (
brew install --cask xquartz)
- Use SSH with X11 forwarding (
-
Token Limit Exceeded
- Issue: Content too large for LLM
- Solution: Adjust
max_tokensor use chunking feature
-
Binary File Handling
- Issue: Binary files not being processed
- Solution: Check
uploadable_extensionsconfiguration
Otherwise:
- Issue: Path separators in configuration
- Solution: Use forward slashes or escaped backslashes
- Issue: X11 clipboard support
- Solution: Install xclip or xsel
- Issue: Clipboard permissions
- Solution: Grant terminal app accessibility permissions
ccontext/
├── ccontext/ # Main package directory
│ ├── __init__.py
│ ├── main.py # Entry point
│ ├── file_tree.py # Tree operations
│ └── ...
├── tests/ # Test directory
├── docs/ # Documentation
└── examples/ # Example configurations
- Clone the repository
- Create a virtual environment
- Install development dependencies
- Run tests
git clone https://github.com/oxillix/ccontext.git
# or
git clone git@github.com:NicolasArnouts/ccontext.git
cd ccontext
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
pip3 install -e .- Fork the repository
- Create a feature branch
- Write tests for new features
- Submit a pull request
- Follow PEP 8 guidelines
- use isort and black
- Use type hints
- Keep functions focused and small
This project is licensed under the MIT License - see the LICENSE file for details.
- Thanks to all contributors! 😊
- Inspired by the need for better context handling in AI interactions.
- Built with love and passion for the developer community! 💖
Feel free to raise issues or contribute to the project. We appreciate your support!
Happy coding adventures! 🚀 Nicolas Arnouts
Looking for a skilled freelancer? I'm available for hire! Let's collaborate — reach out to me at: arnouts.software@gmail.com
When using ccontext's web crawling feature in WSL2, you can use crawl4ai for reliable web content extraction:
- First, set up the WSL2 environment:
python -m ccontext.fix_wsl- Then run the crawler as usual:
python -m ccontext --crawlThe crawler will automatically detect WSL2 and configure the environment appropriately. If you prefer to use the crawler directly:
python -m ccontext.run_crawlers --url https://example.com --output example.mdIf you encounter issues with the crawler in WSL2:
- Ensure Python and dependencies are properly installed
- Try running with explicit parameters:
python -m ccontext.run_crawlers --url https://example.com --output example.md --max-pages 10
- Check that any security software isn't blocking the network connections
- For more detailed logging, add the --verbose flag