claude-mining ⛏️

LLM-powered extraction of insights from your Claude conversation history.

⚠️ PRIVACY FIRST - READ THIS: This repo contains scripts only. Your exported data should stay in a private location (Google Drive, local folder) and NEVER be committed to this repository. All examples in this README use obviously fake data.

This is NOT regex matching. Uses LLMs to intelligently understand relationships, context, and importance - just like a human assistant reading your conversations.

⚡ TL;DR - New Year Contacts in 5 Minutes

Export: claude.ai → Settings → Privacy → Export data
Install: pip install google-genai rapidfuzz
Extract: python scripts/intelligent_contacts.py your_export.json --limit 10
Clean: python scripts/deduplicate_contacts.py contacts.json --auto-merge 0.95
Review: Open contacts.txt → prioritized list for New Year greetings!

Cost: ~$5 for 740 conversations (Gemini 3 Flash)

🎯 Use Cases

Script	Purpose	Method
`intelligent_contacts.py`	Extract & classify people from conversations	Multi-LLM (Gemini/Claude) 🧠
`deduplicate_contacts.py`	Entity resolution & contact cleanup	Fuzzy matching + Human review

See ROADMAP.md for planned features and research.

📊 Model Comparison (December 2025)

Contact extraction is a needle-in-haystack problem - finding names, relationships, and context scattered across hundreds of conversations. After extensive benchmarking, we discovered that Gemini 3 Flash dramatically outperforms Claude models for this specific task at a fraction of the cost.

Benchmark Results (Same 10 Conversations)

Model	Contacts Found	Time	Cost (10 convs)	Cost (740 convs)
Claude Sonnet	7	~2 min	~$1.50	~$111
Claude Opus	45	~5 min	~$6.00	~$444
Gemini 3 Flash	58	74 sec	~$0.10	~$7.40

Note: Costs are estimates based on API pricing. Actual cost for 740 conversations was $4.45 - lower due to pre-filtering (conversations without contact mentions) and actual token consumption. See ROADMAP.md for real-world results.

Key Findings

Gemini 3 Flash found 29% more contacts than Opus (58 vs 45)
60x cheaper than Claude Opus ($7.40 vs $444 for full extraction)
4x faster processing time
Gemini excels at needle-in-haystack tasks (a pattern I first observed during my time at NASDAQ)

Why Gemini Wins Here

This task requires scanning large amounts of text to find scattered mentions of people - exactly the "needle in haystack" problem Gemini models are optimized for. Claude's strength in nuanced reasoning doesn't provide much advantage when the task is fundamentally about recall and pattern detection.

Model Pricing (December 2025)

Model	Input	Output	Best For
Gemini 3 Flash	$0.10/1M	$0.40/1M	Recommended - best value
Gemini 3 Pro	$0.50/1M	$2.00/1M	Higher quality, still cheap
Claude Haiku	$0.25/1M	$1.25/1M	Not recommended for this task
Claude Sonnet	$3.00/1M	$15.00/1M	Good general purpose
Claude Opus	$15.00/1M	$75.00/1M	Best reasoning, overkill here

🧠 What Makes This Different

Regex approach (dumb):

Found: "Alex", "Rob", "Sarah"

LLM approach (intelligent):

• Alexandra Chen (TechVentures Inc) ⭐
  Relationship: Engineering Manager - key project sponsor
  Context: Collaborated on microservices migration
  Sentiment: positive

• Robert Martinez (DataFlow Systems)
  Relationship: Former colleague, now at different company
  Context: Occasional meetups to discuss industry trends
  Sentiment: neutral

• Dr. Sarah Williams (Academic Consulting)
  Relationship: Mentor - career guidance advisor
  Context: Quarterly check-ins on professional development
  Importance: high

🚀 Quick Start

1. Export Your Claude Data

Go to claude.ai
Click profile → Settings → Privacy
Click Export data
Save the JSON to your private location (Google Drive, etc.)

2. Clone This Repo

git clone https://github.com/dzivkovi/claude-mining.git
cd claude-mining

3. Install Dependencies & Set API Key

# For Gemini (recommended - best value)
pip install google-genai
export GOOGLE_API_KEY="your-key-from-aistudio.google.com"

# For Claude (optional - if you prefer Anthropic models)
pip install anthropic
export ANTHROPIC_API_KEY="your-key-from-console.anthropic.com"

4. Run Intelligent Extraction

# Recommended: Gemini 3 Flash (best value, ~$7 for 740 conversations)
python scripts/intelligent_contacts.py ~/GoogleDrive/claude_export.json -m gemini-3-flash

# Preview what will be processed (no API calls)
python scripts/intelligent_contacts.py ~/GoogleDrive/claude_export.json --dry-run

# Claude Opus (best quality, ~$444 for 740 conversations)
python scripts/intelligent_contacts.py ~/GoogleDrive/claude_export.json -m opus

# Test on a small batch first
python scripts/intelligent_contacts.py ~/GoogleDrive/claude_export.json --limit 10

5. Deduplicate & Clean Contacts

After extraction, clean up duplicates and relationship field clutter:

# Install fuzzy matching library
pip install rapidfuzz

# Dry run - preview what will be merged (no changes)
python scripts/deduplicate_contacts.py contacts.json --dry-run --remove-celebrities --remove-self

# Interactive mode - review each merge candidate
python scripts/deduplicate_contacts.py contacts.json --remove-celebrities --remove-self

# Auto-merge high confidence (>0.97), human review for lower scores
python scripts/deduplicate_contacts.py contacts.json --auto-merge 0.97 --remove-celebrities --remove-self

What it does:

Removes celebrity references (Gordon Ramsay, Elon Musk, etc.)
Removes self-references (your own name as "author")
Merges duplicates: "Sarah" + "Sara" + "Sarah Z." → one contact
Cleans relationship fields: 100+ concatenated roles → top 5
Fixes category errors (business contacts in "Family")

6. Review Output

✅ Report saved: claude_export_contacts_report.txt
✅ JSON saved: claude_export_contacts.json

👥 For Other Users

The --remove-self feature has defaults for the original author's name. Override with your own:

python scripts/deduplicate_contacts.py contacts.json \
  --user-name "YourFirstName" \
  --user-last-name "YourLastName" \
  --remove-self

All other features work without modification.

📁 Project Structure

claude-mining/
├── README.md                    # You are here
├── ROADMAP.md                   # Planned features and research
├── .gitignore                   # Protects your data from commits
├── scripts/
│   ├── intelligent_contacts.py  # 🧠 LLM-powered extraction
│   ├── deduplicate_contacts.py  # 🔗 Entity resolution & cleanup
│   ├── holiday_contacts.py      # (Deprecated) Regex fallback
│   └── common.py                # Shared utilities
├── docs/
│   ├── data_format.md           # Claude export format reference
│   └── adr/                     # Architecture Decision Records
└── work/                        # Session notes (gitignored)

💡 How It Works

Load your Claude export (JSON with all conversations)
Filter conversations likely to contain contact mentions
Extract contacts one conversation at a time using tool/function calling
LLM understands context, relationships, sentiment (Gemini or Claude)
Deduplicate and categorize contacts
Output both human-readable report (.txt) and structured data (.json)

🔒 Security Model

┌─────────────────────────────────────────────────────────┐
│  PUBLIC (GitHub)              PRIVATE (Google Drive)    │
│  ─────────────────            ──────────────────────    │
│  • Python scripts             • claude_export.json      │
│  • Documentation              • Output reports          │
│  • .gitignore                 • Personal data           │
└─────────────────────────────────────────────────────────┘

The .gitignore file prevents accidental commits of:

*.json (your export files)
*_contacts.txt (output with names)
data/ folder
Any file with "export" in the name

🛠️ Development

Adding a New Script

Create scripts/your_script.py
Use shared utilities from common.py
Follow the pattern in existing scripts
Update this README

Using Claude Code

# Start Claude Code in the repo
cd claude-mining
claude

# Ask Claude to help
> Add a script that extracts all the technical topics I discussed

📊 Example Output

Note: The examples below use obviously fake data. Your actual output will contain contacts extracted from your conversations.

============================================================
🎄 INTELLIGENT CONTACT EXTRACTION REPORT 🎄
Generated: 2025-12-30 15:42
Total contacts found: 23
============================================================

## Work Colleagues (8)
  • Alexandra Chen (TechVentures Inc) ⭐
    Relationship: senior developer
    Context: Collaborated on microservices architecture
    Sentiment: positive
    Contact: alex.chen@techventures.example

  • Michael Rodriguez (DataFlow Systems)
    Relationship: project manager
    Importance: medium

## Family (4)
  • Mom ⭐
    Relationship: mother
    Importance: high

  • Jamie
    Relationship: younger sibling
    Context: Planning family reunion

## Professional Network (6)
  • Dr. Sarah Williams ⭐
    Relationship: mentor
    Context: Career guidance and technical discussions
    Importance: high

## Recruiters (3)
  • Jennifer Walsh (TalentSearch Partners)
    Relationship: technical recruiter
    Contact: jwalsh@talentsearch.example

============================================================
## 🎄 HOLIDAY GREETING LIST (High + Medium)
============================================================
  ☐ Alexandra Chen - TechVentures Inc
  ☐ Mom
  ☐ Dr. Sarah Williams
  ☐ Michael Rodriguez - DataFlow Systems
  ☐ Jamie

💡 Review and add anyone you remember!
============================================================

Total: 23 contacts across 5 categories

💻 Code Quality

This project emphasizes professional code quality:

Type hints throughout for better IDE support and type safety
Structured logging with configurable verbosity
Comprehensive error handling for file I/O, JSON parsing, and API errors
Multi-provider support - same CLI works with Gemini or Claude
Tool/function calling - structured extraction, no JSON parsing failures
Checkpoint/resume - interrupted runs can continue where they left off
Argparse CLI with helpful flags (-m, -o, --limit, --start, -v)
Proper exit codes for scripting and automation
Privacy-first design with robust .gitignore patterns

🤝 Contributing

PRs welcome! Ideas for new mining scripts:

Meeting/appointment extractor
Code snippet collector
Decision log builder
Learning journal generator

📜 License

MIT - Do whatever you want, just don't blame me.

Created by Daniel Zivkovic / Magma Inc. Powered by Gemini 🚀 and Claude 🤖

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

claude-mining ⛏️

⚡ TL;DR - New Year Contacts in 5 Minutes

🎯 Use Cases

📊 Model Comparison (December 2025)

Benchmark Results (Same 10 Conversations)

Key Findings

Why Gemini Wins Here

Model Pricing (December 2025)

🧠 What Makes This Different

🚀 Quick Start

1. Export Your Claude Data

2. Clone This Repo

3. Install Dependencies & Set API Key

4. Run Intelligent Extraction

5. Deduplicate & Clean Contacts

6. Review Output

👥 For Other Users

📁 Project Structure

💡 How It Works

🔒 Security Model

🛠️ Development

Adding a New Script

Using Claude Code

📊 Example Output

💻 Code Quality

🤝 Contributing

📜 License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
docs		docs
examples		examples
scripts		scripts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
ROADMAP.md		ROADMAP.md
setup.sh		setup.sh

dzivkovi/claude-mining

Folders and files

Latest commit

History

Repository files navigation

claude-mining ⛏️

⚡ TL;DR - New Year Contacts in 5 Minutes

🎯 Use Cases

📊 Model Comparison (December 2025)

Benchmark Results (Same 10 Conversations)

Key Findings

Why Gemini Wins Here

Model Pricing (December 2025)

🧠 What Makes This Different

🚀 Quick Start

1. Export Your Claude Data

2. Clone This Repo

3. Install Dependencies & Set API Key

4. Run Intelligent Extraction

5. Deduplicate & Clean Contacts

6. Review Output

👥 For Other Users

📁 Project Structure

💡 How It Works

🔒 Security Model

🛠️ Development

Adding a New Script

Using Claude Code

📊 Example Output

💻 Code Quality

🤝 Contributing

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages