Skip to content

zenlm/agent-sanitizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

5 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿงน Agent Sanitizer

Clean AI agent logs for safe dataset sharing

Turn your Claude Code, Cursor, Continue, or Aider logs into publishable training datasets with one command.

PyPI License

Quick Start

No installation required! Run with uvx:

# Interactive mode (recommended for first time)
uvx agent-sanitizer

# Specify input/output
uvx agent-sanitizer --input ~/.claude/projects --output ./my-dataset

# Dry run to see what would change
uvx agent-sanitizer --dry-run

# Upload directly to HuggingFace
uvx agent-sanitizer --upload username/my-dataset

What It Does

Agent Sanitizer automatically removes:

  • โœ… Credentials: API keys, passwords, tokens
  • โœ… PII: Names, emails, phone numbers
  • โœ… Crypto: Wallet keys, seed phrases
  • โœ… Paths: Identifying file paths and project names

While preserving:

  • โœ… Code examples: Including test data and development configs
  • โœ… Tool usage: File operations, commands, workflows
  • โœ… Thinking traces: Multi-step reasoning
  • โœ… Context: Real coding patterns

Supported Agents

  • ๐Ÿค– Claude Code (.claude/projects)
  • ๐Ÿ”ฎ Cursor (.cursor/logs)
  • โญ๏ธ Continue (.continue/sessions)
  • ๐Ÿ”ง Aider (.aider/history)
  • ๐Ÿ“ Any JSONL-based agent logs

Features

๐Ÿ”’ Comprehensive Security Audit

Before cleaning, scans for:

  • Credentials (API keys, passwords, tokens)
  • Personal information (names, emails, phones)
  • Cryptocurrency (wallets, keys, seeds)
  • Sensitive data (SSNs, private keys)

๐ŸŽฏ Smart Cleaning

  • Pattern-based detection with low false positives
  • Context-aware replacements
  • Preserves test data and examples
  • Maintains dataset value

๐Ÿ“ค HuggingFace Integration

Upload directly after sanitization:

uvx agent-sanitizer \
    --input ~/.claude/projects \
    --upload username/my-coding-dataset \
    --private  # optional: make dataset private

๐Ÿ” Dry Run Mode

See exactly what would change:

uvx agent-sanitizer --dry-run

Installation

Option 1: Run with uvx (Recommended)

No installation needed:

uvx agent-sanitizer

Option 2: Install with pip

pip install agent-sanitizer
agent-sanitizer --help

Option 3: Install from source

git clone https://github.com/zenlm/agent-sanitizer
cd agent-sanitizer
pip install -e .

Usage Examples

Basic Usage

# Interactive - walks you through the process
uvx agent-sanitizer

# Specify directories
uvx agent-sanitizer \
    --input ~/.claude/projects \
    --output ./clean-dataset

# Non-interactive mode
uvx agent-sanitizer \
    --input ~/.claude/projects \
    --output ./clean-dataset \
    --no-interactive

Upload to HuggingFace

# Login first
huggingface-cli login

# Sanitize and upload in one command
uvx agent-sanitizer \
    --input ~/.claude/projects \
    --upload myusername/my-coding-dataset

# Make it private
uvx agent-sanitizer \
    --input ~/.claude/projects \
    --upload myusername/my-dataset \
    --private

Use Sanitized Dataset

After sanitization, use the dataset:

from datasets import load_dataset

# If uploaded to HuggingFace
dataset = load_dataset("username/my-coding-dataset")

# Or load from local directory
import json

data = []
with open("clean-dataset/splits/train.jsonl") as f:
    for line in f:
        data.append(json.loads(line))

print(f"Loaded {len(data)} examples")

Output Format

Creates a structured dataset:

clean-dataset/
โ”œโ”€โ”€ splits/
โ”‚   โ”œโ”€โ”€ train.jsonl
โ”‚   โ”œโ”€โ”€ val.jsonl
โ”‚   โ””โ”€โ”€ test.jsonl
โ”œโ”€โ”€ audit_report.json
โ””โ”€โ”€ sanitization_summary.json

Each JSONL line contains:

  • timestamp: When the interaction occurred
  • model: Which AI model was used
  • tokens: Token usage breakdown
  • content: Array of content blocks (thinking, tool use, text)
  • cwd: Working directory (anonymized)
  • git_branch: Git context

Security

What Gets Removed

  1. Credentials

    • API keys (OpenAI, Anthropic, etc.)
    • Passwords and auth tokens
    • GitHub personal access tokens
    • SSH/PGP private keys
  2. Personal Information

    • Email addresses (except safe ones like user@example.com)
    • Phone numbers
    • SSNs
    • Personal names
  3. Cryptocurrency

    • Private keys (Ethereum, Bitcoin, etc.)
    • Seed phrases
    • Wallet addresses (when not test data)

What Gets Preserved

  1. Test Data

    • Hardhat/Ganache test accounts
    • BIP-39 test seed phrases
    • Localhost configurations
  2. Examples

    • Code snippets with placeholders
    • Documentation
    • Tutorial content
  3. Technical Context

    • Usernames (demonstrates workflows)
    • Tool usage patterns
    • Error handling flows

Contributing

Contributions welcome! Please:

  1. Check existing issues
  2. Create feature branch
  3. Add tests for new patterns
  4. Submit pull request

Community Datasets

Share your sanitized dataset:

  1. Upload to HuggingFace with --upload
  2. Tag with agent-coding-dataset
  3. Link to original tool in dataset card

Example datasets:

License

BSD-3-Clause - See LICENSE

Links


Made with โค๏ธ for the AI coding community

About

Agent data sanitizer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages