MTG Scraper

A CLI tool for collecting Magic: The Gathering card prices. Features eBay API integration, advanced web scraping with CAPTCHA solving and proxy rotation, and a beautiful interactive menu interface.

Demo Video

Demo_Video_1080p.mov

Features

Interactive Menu: Easy-to-use menu interface
Playwright Integration: Browser automation for eBay!
Adaptive Analyzer: Auto-discovers page structure & selectors
Flexible Search: Search by card name, set, type, or custom query
API Integration: eBay Finding API support (official)
Scrapy Framework: Production-grade spider with advanced features
CAPTCHA Solving: Dual-mode (2Captcha API + local ML with TensorFlow)
Proxy Rotation: Built-in middleware for rotating IP addresses
Database Storage: SQLite integration with SQLAlchemy ORM
Data Visualization: Beautiful formatted tables with filtering
Beautiful CLI: Colorful ASCII art with styled menus
Debug Mode: Auto-screenshots and visible browser option
Configurable: Easy configuration for CAPTCHA, proxies, and APIs
dbt Analytics: Transform and analyze scraped data with SQL models
Data Quality Tests: Automated testing for data integrity
Interactive Analytics: View price trends, statistics, and insights

dbt Analytics Integration

Transform your scraped data into actionable insights with built-in dbt (data build tool) support!

What You Get:

Price Statistics: Min/max/average prices, median, spreads by card
Price Trends: Track price changes over time
Top Cards: Ranked by price, volume, and "hotness" score
Data Quality Tests: Ensure your data is clean and accurate
Interactive Documentation: Auto-generated docs with data lineage

Quick Start:

# Install dbt
./setup_dbt.sh

# Or manually
pip install dbt-core dbt-sqlite

# Run dbt models (from main menu option 7)
python mtgscraper.py
→ Select option 7 (Run dbt Models)

# View analytics (from main menu option 9)
→ Select option 9 (View Analytics Results)

Available Analytics Models:

Model	Description
`card_price_stats`	Price statistics by card (min, max, avg, median, spread)
`price_trends`	Daily price changes and trends over time
`top_cards`	Top cards ranked by various metrics
`dim_cards`	Card dimension table with latest info
`fct_card_prices`	All price observations (fact table)

See DBT_README.md for detailed documentation.

Data Collection Methods

Playwright (Default)

Browser Automation with real Chrome
- Adaptive structure analyzer (auto-discovers selectors)
- Bypasses robots.txt (acts like real user)
- Handles JavaScript automatically
- Stealth mode - harder to detect
- Can run in visible mode to watch it work
- Auto-screenshots for debugging
- Multiple search types: Card name, set, type, custom
- Slower but more effective than Scrapy

Official API

eBay Browse API (RESTful)
- OAuth 2.0 with Client ID + Client Secret
- Free from eBay Developers Program
- No blocking, clean JSON responses
- Modern RESTful API with better rate limits
- Multiple search types: Card name, set, type, custom

Scrapy Spider

Traditional HTTP Scraping
- Very fast (no browser overhead)
- Production features: CAPTCHA solving, proxies, AutoThrottle

Recommendation: Use Playwright (option 1) for actual scraping, or API (option 2) for this demo.

Future Sources

TCGPlayer API
CardKingdom integration
StarCityGames
Card Market (Europe)

Installation

Prerequisites

Python 3.8 or higher
pip package manager
macOS, Linux, or Windows

Setup

Clone the repository:

git clone <your-repo-url>
cd scraperTool

Quick Setup (macOS/Linux):

./setup.sh

Manual Setup:

# Create virtual environment
python3 -m venv venv

# Activate on macOS/Linux
source venv/bin/activate

# Activate on Windows
# venv\Scripts\activate

# Install core dependencies (fast)
pip install -r requirements-minimal.txt

# OR install full features (includes TensorFlow - large download)
pip install -r requirements.txt

# Install dbt core
pip install dbt-core dbt-sqlite

Note: TensorFlow is ~500MB. Use requirements-minimal.txt if you don't need local ML CAPTCHA solving.

(Optional) If using Scrapy-Splash for JavaScript rendering:

docker run -p 8050:8050 scrapinghub/splash

Note: Always activate the virtual environment (source venv/bin/activate) before running the scraper!

Configuration (Optional Services)

1. eBay Browse API (Recommended for production):

# Register at https://developer.ebay.com/
# Get your OAuth credentials (Client ID + Client Secret)
export EBAY_CLIENT_ID="your-app-id-here"
export EBAY_CLIENT_SECRET="your-cert-id-here"

2. CAPTCHA Solver (Optional - Choose one):

Option A - 2Captcha (Paid, Reliable):

# Install: pip install 2captcha-python
# Register at https://2captcha.com/ (~$3 per 1000 solves)
export CAPTCHA_API_KEY="your-2captcha-api-key"

Option B - Local ML Solver (Free, Experimental):

# Install: pip install tensorflow opencv-python pillow (large download ~500MB)
# Enable local ML solver
export USE_LOCAL_CAPTCHA=true

Tip: Both are optional! The scraper works without CAPTCHA solving.

3. Rotating Proxies (For large-scale scraping): Get Proxy rotation with Proxy Rot here:

# Create proxies.txt with format: http://user:pass@host:port
export PROXY_LIST="proxies.txt"

Make persistent:

echo 'export EBAY_CLIENT_ID="your-client-id"' >> ~/.zshrc
echo 'export EBAY_CLIENT_SECRET="your-client-secret"' >> ~/.zshrc
echo 'export CAPTCHA_API_KEY="your-key"' >> ~/.zshrc

Use option 6 in the menu to check configuration status!

Usage

Interactive Menu (Recommended)

Simply run the script to launch the beautiful interactive menu:

python mtgscraper.py

The menu provides:

SCRAPING METHODS:

1. Playwright Scraper - Browser automation (With adaptive structure analyzer)
2. eBay API Search - Official API
3. Scrapy Spider - Fast HTTP scraper (blocked by robots.txt)

VIEW & ANALYZE:

4. View Results - Browse collected data in formatted tables
5. View Card Details - See detailed information for specific cards
6. Database Statistics - View stats about your collection

DBT ANALYTICS:

7. Run dbt Models - Transform data into analytics tables
8. Run dbt Tests - Check data quality
9. View Analytics Results - See price trends and statistics
10. Generate dbt Docs - Create interactive documentation

EXPORT & AUTOMATE:

11. Export to CSV - Save results to CSV file
12. Upload to S3 - Upload data to AWS S3
13. Schedule Cron Job - Automate scraping
14. Remove Cron Jobs - Remove scheduled tasks

SETTINGS:

15. Configure Settings - Set up CAPTCHA, proxies, and API keys
16. Clear Database - Remove all data (with confirmation)
0. Exit - Quit the program

Direct Command Mode

You can also use the tool directly from command line for scripting:

# Quick scrape without menu
python mtgscraper.py --card "Black Lotus" --pages 5

# Or with short flags
python mtgscraper.py -c "Lightning Bolt" -p 3

Example Workflows

Method 1: eBay Browse API (Recommended)

Get your OAuth credentials:

# Register at developer.ebay.com
# Get Client ID (App ID) and Client Secret (Cert ID)
export EBAY_CLIENT_ID="your-app-id"
export EBAY_CLIENT_SECRET="your-cert-id"

Run and search:

python mtgscraper.py
# Select option 2 (eBay API Search)
# Enter card name
# OAuth token will be obtained automatically
# View results with option 4

Method 1: Playwright Browser Scraping (Recommended!)

Install Playwright:

pip install playwright
playwright install chromium

Run and scrape:

python mtgscraper.py
# Select option 1 - Playwright (default!)
# Choose search type (card name, set, type, custom)
# Enter your search query
# Choose headless or visible mode
# Watch the adaptive analyzer discover page structure!
# Get actual eBay results!

Search Examples:

By card: "Black Lotus", "Lightning Bolt"
By set: "Alpha", "Beta", "Modern Masters"
By type: "creature", "planeswalker", "instant"
Custom: "foil rare mythic"

Method 2: eBay Browse API (Production)

Get OAuth credentials and run:

export EBAY_CLIENT_ID="your-app-id"
export EBAY_CLIENT_SECRET="your-cert-id"
python mtgscraper.py
# Select option 2 - eBay Browse API
# OAuth authentication happens automatically
# Choose search type
# Get official data

Method 3: Scrapy Spider (Educational)

Shows ethical scraping with Scrapy, but blocked by robots.txt:

# Select option 3
# Gets 0 results (respects robots.txt)
# Demonstrates production Scrapy architecture

Quick Start (No Configuration)

python mtgscraper.py
# Select option 1 (API simulated mode available)
# Or select option 2 (basic scraping without CAPTCHA solver)

Command Line Options

Option	Short	Description	Default
`--menu`	-	Use interactive menu	True
`--card`	`-c`	Card name for direct scraping	-
`--pages`	`-p`	Number of pages to scrape	3

Examples:

# Interactive menu (default)
python mtgscraper.py

# Direct scrape
python mtgscraper.py -c "Force of Will" -p 5

# Skip menu
python mtgscraper.py --no-menu -c "Lightning Bolt"

Project Structure

scraperTool/
├── mtgscraper/
│   ├── __init__.py          # Package initialization
│   ├── settings.py          # Scrapy settings (AutoThrottle, robots.txt)
│   ├── items.py             # Data models for scraped items
│   ├── pipelines.py         # Database pipeline with SQLAlchemy
│   ├── middlewares.py       # CAPTCHA solver & proxy rotation
│   └── spiders/
│       ├── __init__.py      # Spiders package initialization
│       └── ebay_spider.py   # eBay scraping spider
├── mtgscraper.py            # Main CLI entry point (interactive menu)
├── scrapy.cfg               # Scrapy project configuration
├── requirements.txt         # Python dependencies
├── setup.sh                 # Quick setup script
└── README.md                # This file

Technical Details

Technologies Used

Playwright: Modern browser automation framework (headless Chrome/Firefox)
Scrapy: High-performance web scraping framework
SQLAlchemy: SQL toolkit and ORM for database operations
Click: Command-line interface creation framework
Colorama: Cross-platform colored terminal text
Tabulate: Pretty-print tabular data
BeautifulSoup4: HTML parsing library
2Captcha-Python: Commercial CAPTCHA solving API client
TensorFlow: ML framework for local CAPTCHA solving (optional)
OpenCV: Computer vision library for image processing (optional)
Scrapy-Splash: JavaScript rendering support (optional)

Database Schema

The SQLite database (mtg_cards.db) contains a single table with the following structure:

Column	Type	Description
id	Integer	Primary key
card_name	String	Name of the card
set_name	String	Set/edition name
price	String	Listed price
condition	String	Card condition
seller	String	Seller information
url	String	Link to listing
source	String	Source site (e.g., "eBay")
timestamp	String	Scrape timestamp
shipping	String	Shipping information
buy_it_now	Boolean	Buy It Now listing flag

Scrapy Configuration

User Agent: Modern Chrome browser user agent (looks like real user)
Download Delay: 2 seconds between requests (randomized)
AutoThrottle: Automatically adjusts speed based on server load
Concurrent Requests: Limited to 2 per domain (respectful)
Robots.txt: Enabled - respects website policies
Cookies: Enabled for session management
Retry Logic: Smart retry with backoff for temporary failures

Why These Settings Matter:

Mimics human browsing behavior
Reduces server load
Less likely to be blocked

These settings can be customized in mtgscraper/settings.py.

Development

Adding a New Source

To add a new scraping source:

Create a new spider in mtgscraper/spiders/:

# mtgscraper/spiders/tcgplayer_spider.py
import scrapy
from mtgscraper.items import MtgCardItem

class TcgplayerMtgSpider(scrapy.Spider):
    name = 'tcgplayer'
    allowed_domains = ['tcgplayer.com']
    
    def start_requests(self):
        # Your scraping logic here
        pass
    
    def parse(self, response):
        # Your parsing logic here
        pass

Update the CLI in mtgscraper.py to include the new source:

@click.option('--source', '-s', default='ebay', 
              type=click.Choice(['ebay', 'tcgplayer']), 
              help='Source to scrape from')

Add source-specific configuration to mtgscraper/settings.py if needed.

Customizing Output

To modify the output format, edit the view command in mtgscraper.py. The tabulate library supports multiple table formats:

print(tabulate(rows, headers=headers, tablefmt='fancy_grid'))
# Other formats: 'grid', 'simple', 'plain', 'html', 'latex', etc.

Best Practices

"Good Bot" Behavior

Respect robots.txt: Always obey website rules (ROBOTSTXT_OBEY = True)
Use realistic User-Agents: Identify as a real browser, not as a bot
Control crawl rate: AutoThrottle adjusts speed based on server capacity
Limit concurrency: Only 2 requests per domain at a time
Randomized delays: Mimic human browsing patterns

🚀 Production Features (Already Implemented!)

This scraper includes production-ready features:

CAPTCHA Solving: 2Captcha integration (configure with API key)
Proxy Rotation: Built-in middleware for rotating proxies
AutoThrottle: Adapts crawling speed to server load
Smart Retries: Exponential backoff for failed requests
API Support: Official eBay API integration
Configuration Manager: Easy setup through menu option 6

Future Enhancements

Distributed crawling with Scrapy Cloud
More CAPTCHA types (hCaptcha, FunCaptcha)
User-Agent rotation middleware
Redis-based job queue
Monitoring dashboard

Troubleshooting

Common Issues

CAPTCHA Challenges

Option 1 (Recommended): Use 2Captcha API (~$3 per 1000 solves)
- export CAPTCHA_API_KEY="your-key"
Option 2 (Free): Enable local ML solver (experimental)
- export USE_LOCAL_CAPTCHA=true
- Uses TensorFlow + OpenCV
The middleware automatically tries both methods with fallback

Import Error: No module named 'mtgscraper'

Make sure you're running the script from the project root directory
Ensure the virtual environment is activated: source venv/bin/activate

Database Not Found

Run a scrape first using option 1 in the menu
The database is created automatically on first scrape

No Results Found

Try a more generic card name (e.g., "Bolt" instead of "Lightning Bolt Alpha Edition")
Increase the number of pages to scrape
Check if eBay is blocking (see above)

Scrapy Deprecation Warnings

Warnings about process_start_requests() and Splash middleware are normal
These are from scrapy-splash package compatibility with Scrapy 2.13+
The code works correctly despite the warnings
Can be safely ignored - they're about future Scrapy versions

Twisted Reactor Issues

Fixed by using subprocess to run Scrapy
This allows multiple scrapes in one session
The interactive menu now works smoothly for repeated scrapes

Future Enhancements

Additional Sources

TCGPlayer spider
CardKingdom spider
StarCityGames spider
Card Market (Europe)

Features

Production Scaling

Rotating proxy middleware
User-Agent rotation middleware
CAPTCHA solving integration
Distributed crawling with Scrapy Cloud
Redis-based task queue
Monitoring and alerting (Prometheus/Grafana)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
models		models
mtgscraper		mtgscraper
.gitignore		.gitignore
.user.yml		.user.yml
DBT_README.md		DBT_README.md
README.md		README.md
dbt_project.yml		dbt_project.yml
gradient_showcase.py		gradient_showcase.py
mtgscraper.py		mtgscraper.py
packages.yml		packages.yml
profiles.yml		profiles.yml
requirements-minimal.txt		requirements-minimal.txt
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
setup.sh		setup.sh
setup_dbt.sh		setup_dbt.sh

cjanowski/scrapy_play

Folders and files

Latest commit

History

Repository files navigation

MTG Scraper

Demo Video

Features

dbt Analytics Integration

What You Get:

Quick Start:

Available Analytics Models:

Data Collection Methods

Playwright (Default)

Official API

Scrapy Spider

Future Sources

Installation

Prerequisites

Setup

Configuration (Optional Services)

Usage

Interactive Menu (Recommended)

Direct Command Mode

Example Workflows

Method 1: eBay Browse API (Recommended)

Method 1: Playwright Browser Scraping (Recommended!)

Method 2: eBay Browse API (Production)

Method 3: Scrapy Spider (Educational)

Quick Start (No Configuration)

Command Line Options

Project Structure

Technical Details

Technologies Used

Database Schema

Scrapy Configuration

Development

Adding a New Source

Customizing Output

Best Practices

"Good Bot" Behavior

🚀 Production Features (Already Implemented!)

Future Enhancements

Troubleshooting

Common Issues

Future Enhancements

Additional Sources

Features

Production Scaling

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages