Extract all archived URLs from any domain using the Wayback Machine CDX API
Extract every URL ever archived for any domain:
- π Complete URL Inventory - All pages, posts, categories
- π Export Formats - CSV, JSON, TXT
- π― Smart Filtering - By file type, date range, status codes
- π SEO Gold Mine - Find old content for recovery
- β‘ Fast Extraction - Parallel processing
Perfect for:
- π SEO Agencies - Content audits & recovery
- π Content Strategists - Historical content mapping
- πΌ Business Owners - Recovering lost pages
- π Researchers - URL dataset creation
- βοΈ Legal Teams - Evidence collection
- Download this repository
- Double-click
extractor.py(or run in terminal) - Enter your domain when prompted
- Get your URLs in a CSV file!
# Basic extraction
python extractor.py example.com
# With filters
python extractor.py example.com --format csv --filter "*.html"
# Date range
python extractor.py example.com --from 2020 --to 2023# Clone this repository
git clone https://github.com/waybackrevive/wayback-url-extractor.git
cd wayback-url-extractor
# Install dependencies
pip install -r requirements.txtpython extractor.py example.com# Only HTML pages
python extractor.py example.com --filter "*.html"
# Only images
python extractor.py example.com --filter "*.jpg,*.png,*.gif"
# Only PDFs
python extractor.py example.com --filter "*.pdf"# From specific year
python extractor.py example.com --from 2020
# Date range
python extractor.py example.com --from 2018 --to 2022# CSV (default - best for Excel)
python extractor.py example.com --format csv
# JSON (for developers)
python extractor.py example.com --format json
# Plain text (simple list)
python extractor.py example.com --format txt# Only successful pages (200 OK)
python extractor.py example.com --status 200
# Include redirects
python extractor.py example.com --status "200,301,302"CSV Output: (opens in Excel/Google Sheets)
url,timestamp,status_code,mime_type
http://example.com/,19961231235959,200,text/html
http://example.com/about,19970115120000,200,text/html
http://example.com/contact,19970203093000,200,text/htmlStatistics Report:
π Extracting URLs from: example.com
ββββββββββββββββββββββββββββββββββββ
β
Extraction Complete!
π Summary:
Total URLs: 12,456
Unique URLs: 8,234
Duplicates: 4,222
Date Range: 1996-2026
π File Types:
HTML: 5,234 (63.5%)
Images: 2,101 (25.5%)
CSS: 567 (6.9%)
JS: 332 (4.1%)
πΎ Saved to: example_com_urls.csv
ββββββββββββββββββββββββββββββββββββ
This tool is powerful, but the free version has limits:
- β Extract up to 50,000 URLs per domain
- β Basic filtering and export
- β No bulk domain processing
- β No content recovery
- β No advanced deduplication
- β No automatic content fetching
- β No database reconstruction
- β No broken link fixing
Extracting URLs is just the first step. We can restore everything.
β¨ Complete Website Restoration
- Full content recovery (10,000+ pages)
- All assets (images, videos, documents)
- Database reconstruction
- Working contact forms
β¨ SEO-Optimized Migration
- 301 redirects setup
- Metadata preservation
- Search engine resubmission
- Sitemap regeneration
β¨ Advanced Recovery
- Dynamic content restoration
- Custom functionality
- E-commerce recovery
- Membership site restoration
π― Perfect for SEO Agencies:
- White-label services available
- Bulk domain processing
- Priority support
- Custom solutions
π§ Email: support@waybackrevive.com
π¬ Chat: Available on website
python extractor.py DOMAIN [OPTIONS]
Options:
--format FORMAT Output format: csv, json, txt (default: csv)
--output FILE Output filename (auto-generated if not specified)
--filter PATTERN Filter by pattern: *.html, *.pdf, etc.
--from YEAR Start year (e.g., 2020)
--to YEAR End year (e.g., 2023)
--status CODES Filter by status codes: 200, 301, etc.
--limit NUMBER Max URLs to extract (default: 50000)
--no-duplicates Remove duplicate URLs
--verbose Show detailed progressfrom extractor import WaybackExtractor
# Initialize
extractor = WaybackExtractor('example.com')
# Extract URLs
urls = extractor.extract(
limit=10000,
filter_pattern='*.html',
from_year=2020
)
# Export
extractor.export_csv('output.csv')- Uses Wayback Machine CDX Server API
- Respectful rate limiting
- Automatic retry on failures
- Progress tracking
- Streaming for memory efficiency
- Duplicate detection
- URL normalization
- Pattern matching
- β No data stored on our servers
- β All processing happens locally
- β Open source & auditable
- β No tracking or analytics
- Content Audit - Find all historical content
- Competitor Analysis - See their old pages
- Recovery Planning - Identify valuable content
- Client Reports - Show archive coverage
- Lost Content - Find deleted pages
- Historical URLs - For redirect planning
- Archive Inventory - Know what's saved
- Recovery Assessment - Plan restoration
- Data Mining - Extract URL datasets
- Archive Research - Historical analysis
- Automated Workflows - Bulk processing
- Integration - Use as library
We love contributions!
- Fork this repository
- Create feature branch (
git checkout -b feature/amazing) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing) - Open a Pull Request
MIT License - Free to use and modify
If this tool saved you hours of work:
- β Star this repository
- π¦ Share on social media
- πΌ Hire us for professional recovery
- Use
--limitto start small - Filter by file type first
- Process by year ranges
- Export to JSON for later analysis
- Focus on 200 status codes
- Filter for
*.htmlpages - Look for 404s in old archives
- Compare with current sitemap
- Export to JSON for analysis
- Use date ranges strategically
- Combine with other datasets
- Consider API rate limits
Made with β€οΈ by WaybackRevive Team
waybackrevive.com |
GitHub
Need Help? Contact us at support@waybackrevive.com