Clean AI agent logs for safe dataset sharing
Turn your Claude Code, Cursor, Continue, or Aider logs into publishable training datasets with one command.
No installation required! Run with uvx:
# Interactive mode (recommended for first time)
uvx agent-sanitizer
# Specify input/output
uvx agent-sanitizer --input ~/.claude/projects --output ./my-dataset
# Dry run to see what would change
uvx agent-sanitizer --dry-run
# Upload directly to HuggingFace
uvx agent-sanitizer --upload username/my-datasetAgent Sanitizer automatically removes:
- โ Credentials: API keys, passwords, tokens
- โ PII: Names, emails, phone numbers
- โ Crypto: Wallet keys, seed phrases
- โ Paths: Identifying file paths and project names
While preserving:
- โ Code examples: Including test data and development configs
- โ Tool usage: File operations, commands, workflows
- โ Thinking traces: Multi-step reasoning
- โ Context: Real coding patterns
- ๐ค Claude Code (
.claude/projects) - ๐ฎ Cursor (
.cursor/logs) - โญ๏ธ Continue (
.continue/sessions) - ๐ง Aider (
.aider/history) - ๐ Any JSONL-based agent logs
Before cleaning, scans for:
- Credentials (API keys, passwords, tokens)
- Personal information (names, emails, phones)
- Cryptocurrency (wallets, keys, seeds)
- Sensitive data (SSNs, private keys)
- Pattern-based detection with low false positives
- Context-aware replacements
- Preserves test data and examples
- Maintains dataset value
Upload directly after sanitization:
uvx agent-sanitizer \
--input ~/.claude/projects \
--upload username/my-coding-dataset \
--private # optional: make dataset privateSee exactly what would change:
uvx agent-sanitizer --dry-runNo installation needed:
uvx agent-sanitizerpip install agent-sanitizer
agent-sanitizer --helpgit clone https://github.com/zenlm/agent-sanitizer
cd agent-sanitizer
pip install -e .# Interactive - walks you through the process
uvx agent-sanitizer
# Specify directories
uvx agent-sanitizer \
--input ~/.claude/projects \
--output ./clean-dataset
# Non-interactive mode
uvx agent-sanitizer \
--input ~/.claude/projects \
--output ./clean-dataset \
--no-interactive# Login first
huggingface-cli login
# Sanitize and upload in one command
uvx agent-sanitizer \
--input ~/.claude/projects \
--upload myusername/my-coding-dataset
# Make it private
uvx agent-sanitizer \
--input ~/.claude/projects \
--upload myusername/my-dataset \
--privateAfter sanitization, use the dataset:
from datasets import load_dataset
# If uploaded to HuggingFace
dataset = load_dataset("username/my-coding-dataset")
# Or load from local directory
import json
data = []
with open("clean-dataset/splits/train.jsonl") as f:
for line in f:
data.append(json.loads(line))
print(f"Loaded {len(data)} examples")Creates a structured dataset:
clean-dataset/
โโโ splits/
โ โโโ train.jsonl
โ โโโ val.jsonl
โ โโโ test.jsonl
โโโ audit_report.json
โโโ sanitization_summary.json
Each JSONL line contains:
timestamp: When the interaction occurredmodel: Which AI model was usedtokens: Token usage breakdowncontent: Array of content blocks (thinking, tool use, text)cwd: Working directory (anonymized)git_branch: Git context
-
Credentials
- API keys (OpenAI, Anthropic, etc.)
- Passwords and auth tokens
- GitHub personal access tokens
- SSH/PGP private keys
-
Personal Information
- Email addresses (except safe ones like user@example.com)
- Phone numbers
- SSNs
- Personal names
-
Cryptocurrency
- Private keys (Ethereum, Bitcoin, etc.)
- Seed phrases
- Wallet addresses (when not test data)
-
Test Data
- Hardhat/Ganache test accounts
- BIP-39 test seed phrases
- Localhost configurations
-
Examples
- Code snippets with placeholders
- Documentation
- Tutorial content
-
Technical Context
- Usernames (demonstrates workflows)
- Tool usage patterns
- Error handling flows
Contributions welcome! Please:
- Check existing issues
- Create feature branch
- Add tests for new patterns
- Submit pull request
Share your sanitized dataset:
- Upload to HuggingFace with
--upload - Tag with
agent-coding-dataset - Link to original tool in dataset card
Example datasets:
- zenlm/agent-coding-dataset - 161k Claude Code interactions
BSD-3-Clause - See LICENSE
- GitHub: https://github.com/zenlm/agent-sanitizer
- PyPI: https://pypi.org/project/agent-sanitizer/
- Example Dataset: https://huggingface.co/datasets/zenlm/agent-coding-dataset
- Issues: https://github.com/zenlm/agent-sanitizer/issues
Made with โค๏ธ for the AI coding community