Skip to content

cjanowski/scrapy_play

Repository files navigation

MTG Scraper

A CLI tool for collecting Magic: The Gathering card prices. Features eBay API integration, advanced web scraping with CAPTCHA solving and proxy rotation, and a beautiful interactive menu interface.

Screenshot 2025-10-27 at 11 00 16 AM Screenshot 2025-10-27 at 11 00 51 AM

Demo Video

Demo_Video_1080p.mov

Features

  • Interactive Menu: Easy-to-use menu interface
  • Playwright Integration: Browser automation for eBay!
  • Adaptive Analyzer: Auto-discovers page structure & selectors
  • Flexible Search: Search by card name, set, type, or custom query
  • API Integration: eBay Finding API support (official)
  • Scrapy Framework: Production-grade spider with advanced features
  • CAPTCHA Solving: Dual-mode (2Captcha API + local ML with TensorFlow)
  • Proxy Rotation: Built-in middleware for rotating IP addresses
  • Database Storage: SQLite integration with SQLAlchemy ORM
  • Data Visualization: Beautiful formatted tables with filtering
  • Beautiful CLI: Colorful ASCII art with styled menus
  • Debug Mode: Auto-screenshots and visible browser option
  • Configurable: Easy configuration for CAPTCHA, proxies, and APIs
  • dbt Analytics: Transform and analyze scraped data with SQL models
  • Data Quality Tests: Automated testing for data integrity
  • Interactive Analytics: View price trends, statistics, and insights

dbt Analytics Integration

Transform your scraped data into actionable insights with built-in dbt (data build tool) support!

What You Get:

  • Price Statistics: Min/max/average prices, median, spreads by card
  • Price Trends: Track price changes over time
  • Top Cards: Ranked by price, volume, and "hotness" score
  • Data Quality Tests: Ensure your data is clean and accurate
  • Interactive Documentation: Auto-generated docs with data lineage

Quick Start:

# Install dbt
./setup_dbt.sh

# Or manually
pip install dbt-core dbt-sqlite

# Run dbt models (from main menu option 7)
python mtgscraper.py
→ Select option 7 (Run dbt Models)

# View analytics (from main menu option 9)
→ Select option 9 (View Analytics Results)

Available Analytics Models:

Model Description
card_price_stats Price statistics by card (min, max, avg, median, spread)
price_trends Daily price changes and trends over time
top_cards Top cards ranked by various metrics
dim_cards Card dimension table with latest info
fct_card_prices All price observations (fact table)

See DBT_README.md for detailed documentation.

Data Collection Methods

Playwright (Default)

  • Browser Automation with real Chrome
    • Adaptive structure analyzer (auto-discovers selectors)
    • Bypasses robots.txt (acts like real user)
    • Handles JavaScript automatically
    • Stealth mode - harder to detect
    • Can run in visible mode to watch it work
    • Auto-screenshots for debugging
    • Multiple search types: Card name, set, type, custom
    • Slower but more effective than Scrapy

Official API

  • eBay Browse API (RESTful)
    • OAuth 2.0 with Client ID + Client Secret
    • Free from eBay Developers Program
    • No blocking, clean JSON responses
    • Modern RESTful API with better rate limits
    • Multiple search types: Card name, set, type, custom

Scrapy Spider

  • Traditional HTTP Scraping
    • Very fast (no browser overhead)
    • Production features: CAPTCHA solving, proxies, AutoThrottle

Recommendation: Use Playwright (option 1) for actual scraping, or API (option 2) for this demo.

Future Sources

  • TCGPlayer API
  • CardKingdom integration
  • StarCityGames
  • Card Market (Europe)

Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • macOS, Linux, or Windows

Setup

  1. Clone the repository:
git clone <your-repo-url>
cd scraperTool
  1. Quick Setup (macOS/Linux):
./setup.sh

Manual Setup:

# Create virtual environment
python3 -m venv venv

# Activate on macOS/Linux
source venv/bin/activate

# Activate on Windows
# venv\Scripts\activate

# Install core dependencies (fast)
pip install -r requirements-minimal.txt

# OR install full features (includes TensorFlow - large download)
pip install -r requirements.txt

# Install dbt core
pip install dbt-core dbt-sqlite

Note: TensorFlow is ~500MB. Use requirements-minimal.txt if you don't need local ML CAPTCHA solving.

  1. (Optional) If using Scrapy-Splash for JavaScript rendering:
docker run -p 8050:8050 scrapinghub/splash

Note: Always activate the virtual environment (source venv/bin/activate) before running the scraper!

Configuration (Optional Services)

1. eBay Browse API (Recommended for production):

# Register at https://developer.ebay.com/
# Get your OAuth credentials (Client ID + Client Secret)
export EBAY_CLIENT_ID="your-app-id-here"
export EBAY_CLIENT_SECRET="your-cert-id-here"

2. CAPTCHA Solver (Optional - Choose one):

Option A - 2Captcha (Paid, Reliable):

# Install: pip install 2captcha-python
# Register at https://2captcha.com/ (~$3 per 1000 solves)
export CAPTCHA_API_KEY="your-2captcha-api-key"

Option B - Local ML Solver (Free, Experimental):

# Install: pip install tensorflow opencv-python pillow (large download ~500MB)
# Enable local ML solver
export USE_LOCAL_CAPTCHA=true

Tip: Both are optional! The scraper works without CAPTCHA solving.

3. Rotating Proxies (For large-scale scraping): Get Proxy rotation with Proxy Rot here:

# Create proxies.txt with format: http://user:pass@host:port
export PROXY_LIST="proxies.txt"

Make persistent:

echo 'export EBAY_CLIENT_ID="your-client-id"' >> ~/.zshrc
echo 'export EBAY_CLIENT_SECRET="your-client-secret"' >> ~/.zshrc
echo 'export CAPTCHA_API_KEY="your-key"' >> ~/.zshrc

Use option 6 in the menu to check configuration status!

Usage

Interactive Menu (Recommended)

Simply run the script to launch the beautiful interactive menu:

python mtgscraper.py

The menu provides:

SCRAPING METHODS:

  • 1. Playwright Scraper - Browser automation (With adaptive structure analyzer)
  • 2. eBay API Search - Official API
  • 3. Scrapy Spider - Fast HTTP scraper (blocked by robots.txt)

VIEW & ANALYZE:

  • 4. View Results - Browse collected data in formatted tables
  • 5. View Card Details - See detailed information for specific cards
  • 6. Database Statistics - View stats about your collection

DBT ANALYTICS:

  • 7. Run dbt Models - Transform data into analytics tables
  • 8. Run dbt Tests - Check data quality
  • 9. View Analytics Results - See price trends and statistics
  • 10. Generate dbt Docs - Create interactive documentation

EXPORT & AUTOMATE:

  • 11. Export to CSV - Save results to CSV file
  • 12. Upload to S3 - Upload data to AWS S3
  • 13. Schedule Cron Job - Automate scraping
  • 14. Remove Cron Jobs - Remove scheduled tasks

SETTINGS:

  • 15. Configure Settings - Set up CAPTCHA, proxies, and API keys
  • 16. Clear Database - Remove all data (with confirmation)
  • 0. Exit - Quit the program

Direct Command Mode

You can also use the tool directly from command line for scripting:

# Quick scrape without menu
python mtgscraper.py --card "Black Lotus" --pages 5

# Or with short flags
python mtgscraper.py -c "Lightning Bolt" -p 3

Example Workflows

Method 1: eBay Browse API (Recommended)

  1. Get your OAuth credentials:

    # Register at developer.ebay.com
    # Get Client ID (App ID) and Client Secret (Cert ID)
    export EBAY_CLIENT_ID="your-app-id"
    export EBAY_CLIENT_SECRET="your-cert-id"
  2. Run and search:

    python mtgscraper.py
    # Select option 2 (eBay API Search)
    # Enter card name
    # OAuth token will be obtained automatically
    # View results with option 4

Method 1: Playwright Browser Scraping (Recommended!)

  1. Install Playwright:

    pip install playwright
    playwright install chromium
  2. Run and scrape:

    python mtgscraper.py
    # Select option 1 - Playwright (default!)
    # Choose search type (card name, set, type, custom)
    # Enter your search query
    # Choose headless or visible mode
    # Watch the adaptive analyzer discover page structure!
    # Get actual eBay results!

Search Examples:

  • By card: "Black Lotus", "Lightning Bolt"
  • By set: "Alpha", "Beta", "Modern Masters"
  • By type: "creature", "planeswalker", "instant"
  • Custom: "foil rare mythic"

Method 2: eBay Browse API (Production)

  1. Get OAuth credentials and run:
    export EBAY_CLIENT_ID="your-app-id"
    export EBAY_CLIENT_SECRET="your-cert-id"
    python mtgscraper.py
    # Select option 2 - eBay Browse API
    # OAuth authentication happens automatically
    # Choose search type
    # Get official data

Method 3: Scrapy Spider (Educational)

Shows ethical scraping with Scrapy, but blocked by robots.txt:

# Select option 3
# Gets 0 results (respects robots.txt)
# Demonstrates production Scrapy architecture

Quick Start (No Configuration)

python mtgscraper.py
# Select option 1 (API simulated mode available)
# Or select option 2 (basic scraping without CAPTCHA solver)

Command Line Options

Option Short Description Default
--menu - Use interactive menu True
--card -c Card name for direct scraping -
--pages -p Number of pages to scrape 3

Examples:

# Interactive menu (default)
python mtgscraper.py

# Direct scrape
python mtgscraper.py -c "Force of Will" -p 5

# Skip menu
python mtgscraper.py --no-menu -c "Lightning Bolt"

Project Structure

scraperTool/
├── mtgscraper/
│   ├── __init__.py          # Package initialization
│   ├── settings.py          # Scrapy settings (AutoThrottle, robots.txt)
│   ├── items.py             # Data models for scraped items
│   ├── pipelines.py         # Database pipeline with SQLAlchemy
│   ├── middlewares.py       # CAPTCHA solver & proxy rotation
│   └── spiders/
│       ├── __init__.py      # Spiders package initialization
│       └── ebay_spider.py   # eBay scraping spider
├── mtgscraper.py            # Main CLI entry point (interactive menu)
├── scrapy.cfg               # Scrapy project configuration
├── requirements.txt         # Python dependencies
├── setup.sh                 # Quick setup script
└── README.md                # This file

Technical Details

Technologies Used

  • Playwright: Modern browser automation framework (headless Chrome/Firefox)
  • Scrapy: High-performance web scraping framework
  • SQLAlchemy: SQL toolkit and ORM for database operations
  • Click: Command-line interface creation framework
  • Colorama: Cross-platform colored terminal text
  • Tabulate: Pretty-print tabular data
  • BeautifulSoup4: HTML parsing library
  • 2Captcha-Python: Commercial CAPTCHA solving API client
  • TensorFlow: ML framework for local CAPTCHA solving (optional)
  • OpenCV: Computer vision library for image processing (optional)
  • Scrapy-Splash: JavaScript rendering support (optional)

Database Schema

The SQLite database (mtg_cards.db) contains a single table with the following structure:

Column Type Description
id Integer Primary key
card_name String Name of the card
set_name String Set/edition name
price String Listed price
condition String Card condition
seller String Seller information
url String Link to listing
source String Source site (e.g., "eBay")
timestamp String Scrape timestamp
shipping String Shipping information
buy_it_now Boolean Buy It Now listing flag

Scrapy Configuration

  • User Agent: Modern Chrome browser user agent (looks like real user)
  • Download Delay: 2 seconds between requests (randomized)
  • AutoThrottle: Automatically adjusts speed based on server load
  • Concurrent Requests: Limited to 2 per domain (respectful)
  • Robots.txt: Enabled - respects website policies
  • Cookies: Enabled for session management
  • Retry Logic: Smart retry with backoff for temporary failures

Why These Settings Matter:

  • Mimics human browsing behavior
  • Reduces server load
  • Less likely to be blocked

These settings can be customized in mtgscraper/settings.py.

Development

Adding a New Source

To add a new scraping source:

  1. Create a new spider in mtgscraper/spiders/:
# mtgscraper/spiders/tcgplayer_spider.py
import scrapy
from mtgscraper.items import MtgCardItem

class TcgplayerMtgSpider(scrapy.Spider):
    name = 'tcgplayer'
    allowed_domains = ['tcgplayer.com']
    
    def start_requests(self):
        # Your scraping logic here
        pass
    
    def parse(self, response):
        # Your parsing logic here
        pass
  1. Update the CLI in mtgscraper.py to include the new source:
@click.option('--source', '-s', default='ebay', 
              type=click.Choice(['ebay', 'tcgplayer']), 
              help='Source to scrape from')
  1. Add source-specific configuration to mtgscraper/settings.py if needed.

Customizing Output

To modify the output format, edit the view command in mtgscraper.py. The tabulate library supports multiple table formats:

print(tabulate(rows, headers=headers, tablefmt='fancy_grid'))
# Other formats: 'grid', 'simple', 'plain', 'html', 'latex', etc.

Best Practices

"Good Bot" Behavior

  • Respect robots.txt: Always obey website rules (ROBOTSTXT_OBEY = True)
  • Use realistic User-Agents: Identify as a real browser, not as a bot
  • Control crawl rate: AutoThrottle adjusts speed based on server capacity
  • Limit concurrency: Only 2 requests per domain at a time
  • Randomized delays: Mimic human browsing patterns

🚀 Production Features (Already Implemented!)

This scraper includes production-ready features:

  • CAPTCHA Solving: 2Captcha integration (configure with API key)
  • Proxy Rotation: Built-in middleware for rotating proxies
  • AutoThrottle: Adapts crawling speed to server load
  • Smart Retries: Exponential backoff for failed requests
  • API Support: Official eBay API integration
  • Configuration Manager: Easy setup through menu option 6

Future Enhancements

  • Distributed crawling with Scrapy Cloud
  • More CAPTCHA types (hCaptcha, FunCaptcha)
  • User-Agent rotation middleware
  • Redis-based job queue
  • Monitoring dashboard

Troubleshooting

Common Issues

CAPTCHA Challenges

  • Option 1 (Recommended): Use 2Captcha API (~$3 per 1000 solves)
    • export CAPTCHA_API_KEY="your-key"
  • Option 2 (Free): Enable local ML solver (experimental)
    • export USE_LOCAL_CAPTCHA=true
    • Uses TensorFlow + OpenCV
  • The middleware automatically tries both methods with fallback

Import Error: No module named 'mtgscraper'

  • Make sure you're running the script from the project root directory
  • Ensure the virtual environment is activated: source venv/bin/activate

Database Not Found

  • Run a scrape first using option 1 in the menu
  • The database is created automatically on first scrape

No Results Found

  • Try a more generic card name (e.g., "Bolt" instead of "Lightning Bolt Alpha Edition")
  • Increase the number of pages to scrape
  • Check if eBay is blocking (see above)

Scrapy Deprecation Warnings

  • Warnings about process_start_requests() and Splash middleware are normal
  • These are from scrapy-splash package compatibility with Scrapy 2.13+
  • The code works correctly despite the warnings
  • Can be safely ignored - they're about future Scrapy versions

Twisted Reactor Issues

  • Fixed by using subprocess to run Scrapy
  • This allows multiple scrapes in one session
  • The interactive menu now works smoothly for repeated scrapes

Future Enhancements

Additional Sources

  • TCGPlayer spider
  • CardKingdom spider
  • StarCityGames spider
  • Card Market (Europe)

Features

  • Export to CSV/JSON
  • Price history tracking over time
  • Price alerts via email/SMS
  • Card comparison tool
  • Market analysis and statistics
  • Web dashboard interface (Flask/Django)

Production Scaling

  • Rotating proxy middleware
  • User-Agent rotation middleware
  • CAPTCHA solving integration
  • Distributed crawling with Scrapy Cloud
  • Redis-based task queue
  • Monitoring and alerting (Prometheus/Grafana)

About

Data Collection and Analytics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published