LLM-powered extraction of insights from your Claude conversation history.
β οΈ PRIVACY FIRST - READ THIS: This repo contains scripts only. Your exported data should stay in a private location (Google Drive, local folder) and NEVER be committed to this repository. All examples in this README use obviously fake data.
This is NOT regex matching. Uses LLMs to intelligently understand relationships, context, and importance - just like a human assistant reading your conversations.
- Export: claude.ai β Settings β Privacy β Export data
- Install:
pip install google-genai rapidfuzz - Extract:
python scripts/intelligent_contacts.py your_export.json --limit 10 - Clean:
python scripts/deduplicate_contacts.py contacts.json --auto-merge 0.95 - Review: Open
contacts.txtβ prioritized list for New Year greetings!
Cost: ~$5 for 740 conversations (Gemini 3 Flash)
| Script | Purpose | Method |
|---|---|---|
intelligent_contacts.py |
Extract & classify people from conversations | Multi-LLM (Gemini/Claude) π§ |
deduplicate_contacts.py |
Entity resolution & contact cleanup | Fuzzy matching + Human review |
See ROADMAP.md for planned features and research.
Contact extraction is a needle-in-haystack problem - finding names, relationships, and context scattered across hundreds of conversations. After extensive benchmarking, we discovered that Gemini 3 Flash dramatically outperforms Claude models for this specific task at a fraction of the cost.
| Model | Contacts Found | Time | Cost (10 convs) | Cost (740 convs) |
|---|---|---|---|---|
| Claude Sonnet | 7 | ~2 min | ~$1.50 | ~$111 |
| Claude Opus | 45 | ~5 min | ~$6.00 | ~$444 |
| Gemini 3 Flash | 58 | 74 sec | ~$0.10 | ~$7.40 |
Note: Costs are estimates based on API pricing. Actual cost for 740 conversations was $4.45 - lower due to pre-filtering (conversations without contact mentions) and actual token consumption. See ROADMAP.md for real-world results.
- Gemini 3 Flash found 29% more contacts than Opus (58 vs 45)
- 60x cheaper than Claude Opus ($7.40 vs $444 for full extraction)
- 4x faster processing time
- Gemini excels at needle-in-haystack tasks (a pattern I first observed during my time at NASDAQ)
This task requires scanning large amounts of text to find scattered mentions of people - exactly the "needle in haystack" problem Gemini models are optimized for. Claude's strength in nuanced reasoning doesn't provide much advantage when the task is fundamentally about recall and pattern detection.
| Model | Input | Output | Best For |
|---|---|---|---|
| Gemini 3 Flash | $0.10/1M | $0.40/1M | Recommended - best value |
| Gemini 3 Pro | $0.50/1M | $2.00/1M | Higher quality, still cheap |
| Claude Haiku | $0.25/1M | $1.25/1M | Not recommended for this task |
| Claude Sonnet | $3.00/1M | $15.00/1M | Good general purpose |
| Claude Opus | $15.00/1M | $75.00/1M | Best reasoning, overkill here |
Regex approach (dumb):
Found: "Alex", "Rob", "Sarah"
LLM approach (intelligent):
β’ Alexandra Chen (TechVentures Inc) β
Relationship: Engineering Manager - key project sponsor
Context: Collaborated on microservices migration
Sentiment: positive
β’ Robert Martinez (DataFlow Systems)
Relationship: Former colleague, now at different company
Context: Occasional meetups to discuss industry trends
Sentiment: neutral
β’ Dr. Sarah Williams (Academic Consulting)
Relationship: Mentor - career guidance advisor
Context: Quarterly check-ins on professional development
Importance: high
- Go to claude.ai
- Click profile β Settings β Privacy
- Click Export data
- Save the JSON to your private location (Google Drive, etc.)
git clone https://github.com/dzivkovi/claude-mining.git
cd claude-mining# For Gemini (recommended - best value)
pip install google-genai
export GOOGLE_API_KEY="your-key-from-aistudio.google.com"
# For Claude (optional - if you prefer Anthropic models)
pip install anthropic
export ANTHROPIC_API_KEY="your-key-from-console.anthropic.com"# Recommended: Gemini 3 Flash (best value, ~$7 for 740 conversations)
python scripts/intelligent_contacts.py ~/GoogleDrive/claude_export.json -m gemini-3-flash
# Preview what will be processed (no API calls)
python scripts/intelligent_contacts.py ~/GoogleDrive/claude_export.json --dry-run
# Claude Opus (best quality, ~$444 for 740 conversations)
python scripts/intelligent_contacts.py ~/GoogleDrive/claude_export.json -m opus
# Test on a small batch first
python scripts/intelligent_contacts.py ~/GoogleDrive/claude_export.json --limit 10After extraction, clean up duplicates and relationship field clutter:
# Install fuzzy matching library
pip install rapidfuzz
# Dry run - preview what will be merged (no changes)
python scripts/deduplicate_contacts.py contacts.json --dry-run --remove-celebrities --remove-self
# Interactive mode - review each merge candidate
python scripts/deduplicate_contacts.py contacts.json --remove-celebrities --remove-self
# Auto-merge high confidence (>0.97), human review for lower scores
python scripts/deduplicate_contacts.py contacts.json --auto-merge 0.97 --remove-celebrities --remove-selfWhat it does:
- Removes celebrity references (Gordon Ramsay, Elon Musk, etc.)
- Removes self-references (your own name as "author")
- Merges duplicates: "Sarah" + "Sara" + "Sarah Z." β one contact
- Cleans relationship fields: 100+ concatenated roles β top 5
- Fixes category errors (business contacts in "Family")
β
Report saved: claude_export_contacts_report.txt
β
JSON saved: claude_export_contacts.json
The --remove-self feature has defaults for the original author's name. Override with your own:
python scripts/deduplicate_contacts.py contacts.json \
--user-name "YourFirstName" \
--user-last-name "YourLastName" \
--remove-selfAll other features work without modification.
claude-mining/
βββ README.md # You are here
βββ ROADMAP.md # Planned features and research
βββ .gitignore # Protects your data from commits
βββ scripts/
β βββ intelligent_contacts.py # π§ LLM-powered extraction
β βββ deduplicate_contacts.py # π Entity resolution & cleanup
β βββ holiday_contacts.py # (Deprecated) Regex fallback
β βββ common.py # Shared utilities
βββ docs/
β βββ data_format.md # Claude export format reference
β βββ adr/ # Architecture Decision Records
βββ work/ # Session notes (gitignored)
- Load your Claude export (JSON with all conversations)
- Filter conversations likely to contain contact mentions
- Extract contacts one conversation at a time using tool/function calling
- LLM understands context, relationships, sentiment (Gemini or Claude)
- Deduplicate and categorize contacts
- Output both human-readable report (.txt) and structured data (.json)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PUBLIC (GitHub) PRIVATE (Google Drive) β
β βββββββββββββββββ ββββββββββββββββββββββ β
β β’ Python scripts β’ claude_export.json β
β β’ Documentation β’ Output reports β
β β’ .gitignore β’ Personal data β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The .gitignore file prevents accidental commits of:
*.json(your export files)*_contacts.txt(output with names)data/folder- Any file with "export" in the name
- Create
scripts/your_script.py - Use shared utilities from
common.py - Follow the pattern in existing scripts
- Update this README
# Start Claude Code in the repo
cd claude-mining
claude
# Ask Claude to help
> Add a script that extracts all the technical topics I discussedNote: The examples below use obviously fake data. Your actual output will contain contacts extracted from your conversations.
============================================================
π INTELLIGENT CONTACT EXTRACTION REPORT π
Generated: 2025-12-30 15:42
Total contacts found: 23
============================================================
## Work Colleagues (8)
β’ Alexandra Chen (TechVentures Inc) β
Relationship: senior developer
Context: Collaborated on microservices architecture
Sentiment: positive
Contact: alex.chen@techventures.example
β’ Michael Rodriguez (DataFlow Systems)
Relationship: project manager
Importance: medium
## Family (4)
β’ Mom β
Relationship: mother
Importance: high
β’ Jamie
Relationship: younger sibling
Context: Planning family reunion
## Professional Network (6)
β’ Dr. Sarah Williams β
Relationship: mentor
Context: Career guidance and technical discussions
Importance: high
## Recruiters (3)
β’ Jennifer Walsh (TalentSearch Partners)
Relationship: technical recruiter
Contact: jwalsh@talentsearch.example
============================================================
## π HOLIDAY GREETING LIST (High + Medium)
============================================================
β Alexandra Chen - TechVentures Inc
β Mom
β Dr. Sarah Williams
β Michael Rodriguez - DataFlow Systems
β Jamie
π‘ Review and add anyone you remember!
============================================================
Total: 23 contacts across 5 categories
This project emphasizes professional code quality:
- Type hints throughout for better IDE support and type safety
- Structured logging with configurable verbosity
- Comprehensive error handling for file I/O, JSON parsing, and API errors
- Multi-provider support - same CLI works with Gemini or Claude
- Tool/function calling - structured extraction, no JSON parsing failures
- Checkpoint/resume - interrupted runs can continue where they left off
- Argparse CLI with helpful flags (
-m,-o,--limit,--start,-v) - Proper exit codes for scripting and automation
- Privacy-first design with robust .gitignore patterns
PRs welcome! Ideas for new mining scripts:
- Meeting/appointment extractor
- Code snippet collector
- Decision log builder
- Learning journal generator
MIT - Do whatever you want, just don't blame me.
Created by Daniel Zivkovic / Magma Inc. Powered by Gemini π and Claude π€