📊 Wayback URL Extractor

Extract all archived URLs from any domain using the Wayback Machine CDX API

🎯 What Does This Do?

Extract every URL ever archived for any domain:

🔗 Complete URL Inventory - All pages, posts, categories
📄 Export Formats - CSV, JSON, TXT
🎯 Smart Filtering - By file type, date range, status codes
📈 SEO Gold Mine - Find old content for recovery
⚡ Fast Extraction - Parallel processing

Perfect for:

🔍 SEO Agencies - Content audits & recovery
📝 Content Strategists - Historical content mapping
💼 Business Owners - Recovering lost pages
🎓 Researchers - URL dataset creation
⚖️ Legal Teams - Evidence collection

🚀 Quick Start

For Non-Technical Users

Download this repository
Double-click extractor.py (or run in terminal)
Enter your domain when prompted
Get your URLs in a CSV file!

For Technical Users

# Basic extraction
python extractor.py example.com

# With filters
python extractor.py example.com --format csv --filter "*.html"

# Date range
python extractor.py example.com --from 2020 --to 2023

💻 Installation

# Clone this repository
git clone https://github.com/waybackrevive/wayback-url-extractor.git
cd wayback-url-extractor

# Install dependencies
pip install -r requirements.txt

📖 Usage Examples

Basic URL Extraction

python extractor.py example.com

Filter by File Type

# Only HTML pages
python extractor.py example.com --filter "*.html"

# Only images
python extractor.py example.com --filter "*.jpg,*.png,*.gif"

# Only PDFs
python extractor.py example.com --filter "*.pdf"

Date Range Extraction

# From specific year
python extractor.py example.com --from 2020

# Date range
python extractor.py example.com --from 2018 --to 2022

Export Formats

# CSV (default - best for Excel)
python extractor.py example.com --format csv

# JSON (for developers)
python extractor.py example.com --format json

# Plain text (simple list)
python extractor.py example.com --format txt

Status Code Filtering

# Only successful pages (200 OK)
python extractor.py example.com --status 200

# Include redirects
python extractor.py example.com --status "200,301,302"

📊 Output Example

CSV Output: (opens in Excel/Google Sheets)

url,timestamp,status_code,mime_type
http://example.com/,19961231235959,200,text/html
http://example.com/about,19970115120000,200,text/html
http://example.com/contact,19970203093000,200,text/html

Statistics Report:

🔍 Extracting URLs from: example.com
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✅ Extraction Complete!

📊 Summary:
   Total URLs: 12,456
   Unique URLs: 8,234
   Duplicates: 4,222
   Date Range: 1996-2026
   
📁 File Types:
   HTML:  5,234 (63.5%)
   Images: 2,101 (25.5%)
   CSS:    567 (6.9%)
   JS:     332 (4.1%)
   
💾 Saved to: example_com_urls.csv

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️ Free Version Limitations

This tool is powerful, but the free version has limits:

✅ Extract up to 50,000 URLs per domain
✅ Basic filtering and export
❌ No bulk domain processing
❌ No content recovery
❌ No advanced deduplication
❌ No automatic content fetching
❌ No database reconstruction
❌ No broken link fixing

🚀 Need Professional Recovery?

Extracting URLs is just the first step. We can restore everything.

Our Professional Services Include:

✨ Complete Website Restoration

Full content recovery (10,000+ pages)
All assets (images, videos, documents)
Database reconstruction
Working contact forms

✨ SEO-Optimized Migration

301 redirects setup
Metadata preservation
Search engine resubmission
Sitemap regeneration

✨ Advanced Recovery

Dynamic content restoration
Custom functionality
E-commerce recovery
Membership site restoration

👉 Get Professional Help → waybackrevive.com/contact-us

🎯 Perfect for SEO Agencies:

White-label services available
Bulk domain processing
Priority support
Custom solutions

📧 Email: support@waybackrevive.com
💬 Chat: Available on website

🛠️ Advanced Features

Command-Line Options

python extractor.py DOMAIN [OPTIONS]

Options:
  --format FORMAT      Output format: csv, json, txt (default: csv)
  --output FILE        Output filename (auto-generated if not specified)
  --filter PATTERN     Filter by pattern: *.html, *.pdf, etc.
  --from YEAR          Start year (e.g., 2020)
  --to YEAR            End year (e.g., 2023)
  --status CODES       Filter by status codes: 200, 301, etc.
  --limit NUMBER       Max URLs to extract (default: 50000)
  --no-duplicates      Remove duplicate URLs
  --verbose            Show detailed progress

Programmatic Usage

from extractor import WaybackExtractor

# Initialize
extractor = WaybackExtractor('example.com')

# Extract URLs
urls = extractor.extract(
    limit=10000,
    filter_pattern='*.html',
    from_year=2020
)

# Export
extractor.export_csv('output.csv')

🔧 Technical Details

API Integration

Uses Wayback Machine CDX Server API
Respectful rate limiting
Automatic retry on failures
Progress tracking

Data Processing

Streaming for memory efficiency
Duplicate detection
URL normalization
Pattern matching

Privacy & Security

✅ No data stored on our servers
✅ All processing happens locally
✅ Open source & auditable
✅ No tracking or analytics

📈 Use Cases

For SEO Agencies

Content Audit - Find all historical content
Competitor Analysis - See their old pages
Recovery Planning - Identify valuable content
Client Reports - Show archive coverage

For Business Owners

Lost Content - Find deleted pages
Historical URLs - For redirect planning
Archive Inventory - Know what's saved
Recovery Assessment - Plan restoration

For Developers

Data Mining - Extract URL datasets
Archive Research - Historical analysis
Automated Workflows - Bulk processing
Integration - Use as library

🤝 Contributing

We love contributions!

Fork this repository
Create feature branch (git checkout -b feature/amazing)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing)
Open a Pull Request

📜 License

MIT License - Free to use and modify

⭐ Support This Project

If this tool saved you hours of work:

⭐ Star this repository
🐦 Share on social media
💼 Hire us for professional recovery

🔗 Resources

💡 Tips & Tricks

For Large Sites

Use --limit to start small
Filter by file type first
Process by year ranges
Export to JSON for later analysis

For SEO Work

Focus on 200 status codes
Filter for *.html pages
Look for 404s in old archives
Compare with current sitemap

For Research

Export to JSON for analysis
Use date ranges strategically
Combine with other datasets
Consider API rate limits

Made with ❤️ by WaybackRevive Team
waybackrevive.com | GitHub

_{Need Help? Contact us at support@waybackrevive.com}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
EXAMPLES.md		EXAMPLES.md
LICENSE		LICENSE
README.md		README.md
extractor.py		extractor.py

Folders and files

Latest commit

History

Repository files navigation

📊 Wayback URL Extractor

🎯 What Does This Do?

🚀 Quick Start

For Non-Technical Users

For Technical Users

💻 Installation

📖 Usage Examples

Basic URL Extraction

Filter by File Type

Date Range Extraction

Export Formats

Status Code Filtering

📊 Output Example

⚠️ Free Version Limitations

🚀 Need Professional Recovery?

Our Professional Services Include:

👉 Get Professional Help → waybackrevive.com/contact-us

🛠️ Advanced Features

Command-Line Options

Programmatic Usage

🔧 Technical Details

API Integration

Data Processing

Privacy & Security

📈 Use Cases

For SEO Agencies

For Business Owners

For Developers

🤝 Contributing

📜 License

⭐ Support This Project

🔗 Resources

💡 Tips & Tricks

For Large Sites

For SEO Work

For Research

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages