A data pipeline for collecting, processing, and analyzing BoardGameGeek game data.
The BGG Data Pipeline is designed to efficiently collect, process, and analyze board game data from BoardGameGeek, with a robust architecture that ensures data integrity and performance.
-
Tracking Tables Architecture ✨ NEW
- Resolved BigQuery streaming buffer limitations with append-only tracking tables
- Separate
fetched_responsesandprocessed_responsestables for better audit trails - INSERT-only operations eliminate UPDATE errors on streaming data
- Improved refresh logic based on fetch timestamps instead of process timestamps
- Comprehensive migration documentation in
docs/MIGRATION_TRACKING_TABLES.md
-
Code Consolidation
- Eliminated duplicate response storage logic across modules
- Unified response handling through
BGGResponseFetcher - Cleaner separation of concerns between fetching and processing
-
UV Package Manager Integration
- Replaced pip with UV for faster, more reliable package management
- Improved dependency resolution and virtual environment handling
- Enhanced project setup and development workflow
-
Enhanced Response Handling
- Intelligent processing of game IDs with no response
- Graceful handling of empty or problematic API responses
- Detailed logging and status tracking for API interactions
- Improved error handling and data integrity checks
-
Cloud Run Integration
- Automated pipeline runs via Cloud Run jobs
- Streamlined deployment process with Cloud Build
- Enhanced environment configuration handling
- Comprehensive GitHub Actions workflows
- Retrieves universe of board game IDs from BoardGameGeek
- Stores IDs in BigQuery's
thing_idstable - Runs as a Cloud Run job in both prod and dev environments
- Continuously fetches game data from the BGG API
- Stores raw XML responses in BigQuery's
raw_responsestable - Tracks fetch operations in
fetched_responsestracking table - Advanced error handling for various API response scenarios
- Handles API rate limiting and retries
- Runs as a Cloud Run job in both prod and dev environments
- Fetches games in chunks (default 20 games per API call)
- Automatically marks game IDs with no response or parsing errors
- Processes raw responses into normalized tables
- Tracks processing operations in
processed_responsestracking table - Robust handling of incomplete or problematic game data
- Uses INSERT-only operations to avoid BigQuery streaming buffer conflicts
- Runs as a Cloud Run Job
- Supports multiple parallel tasks (default 5 concurrent processing jobs)
- Handles processing errors without disrupting overall data fetching
- Automatically retries failed processing attempts
- Scheduled to run every 3 hours
- Enhanced logging and status tracking
- Refreshes previously loaded games based on publication year and age
- Prioritizes recently published games for more frequent updates
- Uses
fetched_responsestracking to prevent duplicate fetches - Automatically triggers processor after fetching new data
- Configurable refresh intervals for different game age categories
graph TD
A[BGG XML API2] -->|Fetch Game IDs| B[ID Fetcher]
B -->|Store IDs| C[thing_ids Table]
C -->|Unfetched IDs| D[Response Fetcher]
D -->|Rate-Limited Requests| A
D -->|Store Raw Data| E[raw_responses Table]
D -->|Track Fetch| F[fetched_responses Table]
F -->|Unprocessed Records| G[Response Processor]
E -->|Raw XML/JSON| G
G -->|Validate & Transform| H[Normalized Tables]
G -->|Track Processing| I[processed_responses Table]
H -->|Load| J[BigQuery Warehouse]
- Automated BGG game data collection
- Intelligent rate-limited API client
- Append-only tracking tables for BigQuery streaming compatibility
- Comprehensive data validation
- BigQuery data warehouse integration
- Robust error handling and retry mechanisms
- Separate tracking of fetch and process operations
- Detailed audit trails for all pipeline operations
- Raw data archival to Cloud Storage
- Python 3.12+
- UV package manager (required, replaces pip)
- Google Cloud project with required APIs enabled:
- Cloud Run
- Cloud Build
- Cloud Scheduler
- BigQuery
- Container Registry
- Service account with necessary permissions:
- Cloud Run Invoker
- Cloud Build Editor
- BigQuery Data Editor
- GitHub repository secrets configured:
SERVICE_ACCOUNT_KEY: GCP service account key JSONGCP_PROJECT_ID: Google Cloud project IDBGG_API_TOKEN: BoardGameGeek API authentication token
- GitHub repository variables:
ENVIRONMENT: Deployment environment (e.g., 'prod', 'dev')
- Clone the repository:
git clone https://github.com/phenrickson/bgg-data-warehouse.git
cd bgg-data-warehouse- Create service account and grant permissions:
# Create service account
gcloud iam service-accounts create bgg-processor \
--display-name="BGG Processor Service Account"
# Grant required permissions
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:bgg-processor@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/bigquery.dataEditor"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:bgg-processor@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/run.invoker"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:bgg-processor@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/cloudbuild.builds.editor"- Install UV package manager:
# Using winget (Windows)
winget install --id=astral-sh.uv -e
# Using Homebrew (macOS)
brew install uv
# Using the installer script (Linux/macOS)
curl -LsSf https://astral.sh/uv/install.sh | sh- Install dependencies:
# Create virtual environment
uv venv
# Activate virtual environment
# On Windows:
.venv\Scripts\activate
# On Unix/macOS:
source .venv/bin/activate
# Install dependencies
uv sync- Configure Google Cloud credentials:
export GOOGLE_APPLICATION_CREDENTIALS="path/to/service-account-key.json"- Configure environment variables:
# Copy example environment file
cp .env.example .env
# Edit .env file with your configuration
# Required variables:
# - GCP_PROJECT_ID: Your Google Cloud project ID
# - ENVIRONMENT: 'dev', 'test', or 'prod'
# - BGG_API_TOKEN: Your BoardGameGeek API token- Configure GitHub repository:
- Add required secrets in repository settings:
SERVICE_ACCOUNT_KEY: Your GCP service account key JSONGCP_PROJECT_ID: Your Google Cloud project IDBGG_API_TOKEN: Your BoardGameGeek API token
- Add repository variables:
ENVIRONMENT: Set to 'prod' or 'dev'
- Add required secrets in repository settings:
To run the pipeline components locally:
# Fetch board game IDs
uv run python -m src.pipeline.fetch_ids --environment=dev
# Fetch new board game data
uv run python -m src.pipeline.fetch_responses --environment=dev
# Process raw responses
uv run python -m src.pipeline.process_responses --environment=devThe pipeline runs automatically via Cloud Run jobs:
bgg-fetch-ids: Fetches game IDs from BoardGameGeekbgg-fetch-responses: Fetches new game data every 3 hoursbgg-process-responses: Processes raw responses every 3 hours
To manually trigger jobs:
# Trigger fetch IDs job
gcloud run jobs execute bgg-fetch-ids-dev \
--region us-central1 \
--wait
# Trigger fetch responses job
gcloud run jobs execute bgg-fetch-responses-dev \
--region us-central1 \
--wait
# Trigger process responses job
gcloud run jobs execute bgg-process-responses-dev \
--region us-central1 \
--waitThe project uses GitHub Actions for automated deployment and job execution:
-
Deployment Workflow (
deploy.yml):- Triggered on pushes to main branch
- Builds and deploys Cloud Run jobs
- Updates job configurations
-
Pipeline Workflow (
pipeline.yml):- Runs every day at 6 AM UTC via cron schedule
- Executes fetch IDs, fetch responses, and process jobs sequentially
- Monitors job completion status
To manually trigger workflows:
- Use GitHub Actions UI
- Select workflow
- Click "Run workflow"
-
API Errors:
- Fetcher retries with exponential backoff
- Failed requests logged in
request_logtable - Automatic marking of game IDs with no response in
fetched_responses
-
Processing Errors:
- Each response can be retried up to 3 times
- Errors tracked in
processed_responsestable with detailed error messages - Failed items do not block other processing
- Detailed logging of processing challenges
- No streaming buffer conflicts due to INSERT-only operations
-
BigQuery Streaming Buffer:
- All status tracking uses INSERT-only operations
- Separate tracking tables eliminate UPDATE/DELETE conflicts
- See
docs/MIGRATION_TRACKING_TABLES.mdfor architecture details
- Create virtual environment:
uv venv
source .venv/bin/activate # Unix/macOS
.venv\Scripts\activate # Windows- Install development dependencies:
uv pip install -e ".[dev]"- Test components:
# Test ID fetcher
uv run python -m src.pipeline.fetch_ids --environment=dev
# Test response fetcher
uv run python -m src.pipeline.fetch_responses --environment=dev
# Test processor
uv run python -m src.pipeline.process_responses --environment=dev
# Run test suite
uv run pytest-
Modify processor:
- Update
process_responses.py - Build and test locally
- Run tests:
uv run pytest tests/test_processor.py - Deploy new version:
gcloud builds submit
- Update
-
Modify fetcher:
- Update
fetch_responses.py - Run tests:
uv run pytest tests/test_api_integration.py - Deploy new version via GitHub Actions
- Update
The BGG Data Warehouse includes a Streamlit dashboard for monitoring the data pipeline and exploring the collected data.
- Overview of key metrics (total games, ranked games, etc.)
- Game metadata statistics (categories, mechanics, designers, etc.)
- Time series visualizations of fetch and processing activities
- Game search functionality
The dashboard is automatically deployed to Google Cloud Run when changes are made to the visualization code. You can access it at:
https://bgg-dashboard-[hash].run.app
Where [hash] is a unique identifier assigned by Google Cloud Run. The exact URL will be output at the end of the GitHub Actions workflow run.
To run the dashboard locally:
# Install dependencies
uv pip install -e .
# Run the dashboard
streamlit run src/visualization/dashboard.py
# Run the game search dashboard
streamlit run src/visualization/game_search_dashboard.py
# Run the combined dashboard
streamlit run src/visualization/combined_dashboard.pyThis will start the dashboard on http://localhost:8501
The dashboard is automatically deployed via GitHub Actions when changes are pushed to the main branch that affect files in the src/visualization/ directory. You can also manually trigger the deployment from the GitHub Actions tab.
The deployment workflow:
- Builds a Docker image for the dashboard using
Dockerfile.dashboard - Pushes the image to Google Container Registry
- Deploys the image to Google Cloud Run
- Outputs the URL where the dashboard can be accessed
- Fork the repository
- Create a feature branch
- Make your changes
- Set up development environment:
uv venv source .venv/bin/activate # Unix/macOS .venv\Scripts\activate # Windows uv pip install -e ".[dev]"
- Run tests and linting:
uv run pytest uv run ruff check . uv run black .
- Submit a pull request
Your pull request will trigger:
- Automated tests
- Code quality checks
- Test deployment to development environment
- Manual review process
- Automated deployment to production (when merged)
Additional documentation is available in the docs/ directory:
- Architecture Overview
- BGG API Documentation
- Dashboard Deployment Guide
- Tracking Tables Migration - Details on BigQuery streaming buffer solution
MIT License - see LICENSE file for details.