This project develops predictive models for board game characteristics using BoardGameGeek (BGG) data. The system predicts game complexity, average rating, number of users rated, and calculates estimated geek ratings using a comprehensive machine learning pipeline.
- Complete ML Pipeline: From data extraction to model deployment
- Multiple Model Types: Hurdle classification, complexity estimation, rating prediction, and user engagement modeling
- Time-Based Evaluation: Rolling window validation for temporal robustness
- Production Deployment: FastAPI scoring service with model registration and versioning
- Interactive Dashboards: Streamlit-based monitoring and visualization tools
- Cloud Integration: Google Cloud Platform integration for data storage and model deployment
bgg-predictive-models/
├── config/ # Configuration files
├── credentials/ # Credential management
├── data/ # Data storage and predictions
├── figures/ # Visualization outputs
├── models/ # Trained models and experiments
├── scoring_service/ # Production deployment service
├── src/ # Primary source code
│ ├── data/ # Data loading and preparation
│ ├── features/ # Feature engineering and preprocessing
│ ├── models/ # Machine learning models
│ ├── monitor/ # Experiment and prediction monitoring
│ ├── scripts/ # Utility scripts
│ ├── utils/ # Utility functions
│ └── visualizations/ # Data visualization scripts
├── tests/ # Unit and integration tests
├── train.py # Time-based model evaluation script
├── predict.py # Prediction generation script
└── Makefile # Automated workflow commands
- Data Pipeline: Automated BGG data extraction and materialized views in BigQuery
- Feature Engineering: Comprehensive preprocessing pipeline with multiple transformer types
- Model Training: Four distinct model types with hyperparameter optimization
- Hurdle Model: Predicts likelihood of games receiving ratings (logistic regression)
- Complexity Model: Estimates game complexity (CatBoost/Ridge regression)
- Rating Model: Predicts average game rating (CatBoost/Ridge regression)
- Users Rated Model: Predicts number of users who will rate the game (LightGBM/Ridge regression)
- Geek Rating Calculation: Bayesian average calculation using predicted components
- Time-Based Evaluation: Rolling window validation across multiple years
- Experiment Tracking: Comprehensive experiment management and versioning
- Model Registration: Production model registration with validation and versioning
- Scoring Service: FastAPI-based REST API for model inference
- Interactive Dashboards: Real-time monitoring and visualization tools
- Cloud Deployment: Docker containers and Google Cloud Run deployment
- Model Performance Optimization: Continuous improvement of prediction accuracy
- Feature Engineering: Advanced feature transformations and selection
- Ensemble Methods: Combining multiple models for improved predictions
- Real-time Monitoring: Enhanced model performance tracking
- Python 3.12+
- UV package manager
- Google Cloud credentials (for data access)
- Docker (for deployment)
# Install UV (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone repository
git clone https://github.com/phenrickson/bgg-predictive-models.git
cd bgg-predictive-models
# Install dependencies
uv sync
# Set up environment variables
cp .env.example .env
# Edit .env with your configuration# Fetch raw data from BigQuery
make data
# Create materialized views
uv run src/data/create_view.py# Train all models with default settings
make models
# Or train individual models
make hurdle # Train hurdle classification model
make complexity # Train complexity estimation model
make rating # Train rating prediction model
make users_rated # Train users rated prediction model# Generate predictions for future games (2024-2029)
make predictions
# Or use the prediction script directly
uv run predict.py --start-year 2024 --end-year 2029# Run time-based evaluation across multiple years
make evaluate
# View current year configuration
make years# Launch experiment monitoring dashboard
make experiment_dashboard
# Launch geek rating analysis dashboard
make geek_rating_dashboard
# Launch unsupervised learning dashboard
make unsupervised_dashboardflowchart TD
A[Raw BGG Data] --> B[Feature Engineering]
B --> C[Hurdle Model]
B --> D[Complexity Model]
D --> H[Predicted Complexity]
B --> F[Rating Model]
B --> G[Users Rated Model]
H --> F
H --> G
C --> E{Game Likely Rated?}
E -->|Yes| I[Apply Rating Predictions]
E -->|Yes| J[Apply Users Rated Predictions]
F --> I
G --> J
I --> K[Predicted Rating]
J --> L[Predicted Users Rated]
H --> M[Geek Rating Calculation]
K --> M
L --> M
M --> N[Final Predictions]
| Model Type | Purpose | Default Algorithm | Features |
|---|---|---|---|
| Hurdle | Classification of games likely to receive ratings | Logistic Regression | Linear preprocessor, probability output |
| Complexity | Game complexity estimation (1-5 scale) | CatBoost Regressor | Tree-based preprocessor, sample weights |
| Rating | Average rating prediction | CatBoost Regressor | Includes predicted complexity, sample weights |
| Users Rated | Number of users prediction | LightGBM Regressor | Log-transformed target, includes complexity |
- Categorical Encoding: Target encoding for high-cardinality features
- Numerical Transformations: Log transforms, polynomial features, binning
- Temporal Features: Year-based transformations and era encoding
- Text Processing: Game description and mechanic embeddings
- Sample Weighting: Recency-based weighting for temporal relevance
# Register models for production use
make register
# Or register individual models
make register_complexity
make register_rating
make register_users_rated
make register_hurdle# Run scoring service locally
make scoring-service
# Build and test Docker containers
make docker-scoring
make docker-training
# Deploy to Google Cloud Run
gcloud builds submit --config scoring_service/cloudbuild.yamlimport requests
# Score new games
response = requests.post(
"http://localhost:8080/score",
json={
"model_type": "rating",
"model_name": "rating-v2025",
"start_year": 2024,
"end_year": 2029
}
)
predictions = response.json()Key environment variables (see .env.example):
# Google Cloud Configuration
GCP_PROJECT_ID=your-project-id
BGG_DATASET=bgg_data_dev
GCS_BUCKET_NAME=your-bucket-name
# Model Configuration
CURRENT_YEAR=2025Default model configurations can be customized via command line arguments or the Makefile:
# Example: Use different algorithms
make complexity COMPLEXITY_MODEL=lightgbm
make rating RATING_MODEL=catboost
make users_rated USERS_RATED_MODEL=ridge# Format code
make format
# Lint code
make lint
# Run tests
uv run pytest# Upload experiments to cloud storage
make upload_experiments
# Download experiments from cloud storage
make download_experiments
# Clean local experiments
make clean_experimentsThe project uses a sophisticated time-based evaluation system:
- Training Data: All games published before 2021
- Tuning Data: Games published in 2021
- Testing Data: Games published in 2022
- Prediction Target: Games published 2024-2029
- Experiment Dashboard: Compare model performance across experiments
- Geek Rating Dashboard: Analyze predicted vs actual geek ratings
- Unsupervised Dashboard: Explore clustering and dimensionality reduction
- Predictions Dashboard: Monitor prediction quality and distributions
- Classification: Precision, Recall, F1-score, AUC-ROC
- Regression: RMSE, MAE, R², Mean Absolute Percentage Error
- Temporal Stability: Performance consistency across time periods
- Feature Importance: Model interpretability metrics
# Evaluate models across multiple years with custom parameters
uv run train.py \
--start-year 2016 \
--end-year 2022 \
--model-args \
complexity.model=catboost \
complexity.use-sample-weights=true \
rating.model=ridge \
rating.min-ratings=5# Generate predictions with custom parameters
uv run predict.py \
--hurdle linear-hurdle \
--complexity catboost-complexity \
--rating ridge-rating \
--users-rated lightgbm-users_rated \
--start-year 2020 \
--end-year 2030 \
--threshold 0.6 \
--prior-rating 5.5 \
--prior-weight 2000- Fork the repository
- Create a feature branch
- Make changes with appropriate tests
- Run code quality checks:
make format lint - Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- BoardGameGeek for providing comprehensive board game data
- The open-source machine learning community for excellent tools and libraries