PHASM is a Bayesian projection system for MLB hitters and pitchers. It combines multivariate outcome modeling, hierarchical player/position effects, and AR(1) year trends to produce probabilistic forecasts of per-PA/per-IP rates and rate stats. The system also supports total-count projections when paired with external PA/IP forecasts.
- Fits a joint multivariate Bayesian model (rstan) for H, R, RBI, HR, SB (per-PA rates) plus AVG, OBP, SLG.
- Fits a joint multivariate Bayesian model (rstan) for SP: SO, BB, H, ER, W, QS (per-IP rates).
- Fits a joint multivariate Bayesian model (rstan) for RP: SO, BB, H, ER, W, SVHLD (per-IP rates).
- Uses age/aging curve and position (hitters); the current pitcher model is SP-only.
- Player random intercepts and age slopes; position random intercepts and age/age^2 slopes.
- Year random intercepts with AR(1) evolution.
- Hitter Stan model:
models/model.stan - Hitter R driver:
models/fit_model.R - SP Stan model:
models/sp_model.stan - SP R driver:
models/fit_sp_model.R - RP Stan model:
models/rp_model.stan - RP R driver:
models/fit_rp_model.R - Hitter inputs:
data/fangraphs_batters_2018_2025.csv - Pitcher inputs:
data/fangraphs_pitchers_2018_2025.csv - Hitter outputs (after fitting):
models/model_fit.rdsmodels/model_inputs.rdsresults/projections/batters/category_projections_2026.csv
- SP outputs (after fitting):
models/sp_model_fit.rdsmodels/sp_model_inputs.rdsresults/projections/pitchers/sp_category_projections_2026.csv
- RP outputs (after fitting):
models/rp_model_fit.rdsmodels/rp_model_inputs.rdsresults/projections/pitchers/rp_category_projections_2026.csv
- Age (standardized) and age^2
- SP-only pitcher model (no role random effects)
- 2026 covariates are taken from the most recent season per player (age advanced by +1).
- Count outcomes are modeled as Poisson with a log(PA) offset; projections are per-PA rates.
- Pitcher count outcomes are modeled as Poisson with a log(IP) offset; projections are per-IP rates.
- AVG/OBP use a logit transform; SLG uses log(SLG + 1e-4).
- Handedness and Statcast covariates are excluded by design.
- Seasons with PA < 100 are excluded from the dataset before fitting.
- Seasons with IP < 20 are excluded from the pitcher dataset before fitting.
This repo pulls FanGraphs data via baseballr and builds the model dataset:
Rscript data/build_fangraphs_batters_from_baseballr.RThat script:
- Fetches 2018–2025 batting leaderboards from FanGraphs (requires internet)
- Keeps seasons with
PA >= 100 - Keeps players with
>= 100 PAin either 2024 or 2025 - Writes
data/fangraphs_batters_2018_2025.csv
Rscript models/fit_model.ROutputs:
models/model_fit.rdsmodels/model_inputs.rdsresults/projections/batters/category_projections_2026.csv
Rscript results/scripts/build_composite_projections.ROutputs:
results/projections/batters/composite_projections_2026.csv
Rscript results/scripts/build_top20_composite_by_position.ROutputs:
results/projections/batters/top20_composite_by_position.md
Rscript results/scripts/plot_latent_fit_top100_by_category.ROutputs:
results/plots/fitted_outcome_curves/batters/latent_fit_top100_<CATEGORY>.pdf
Rscript data/build_fangraphs_pitchers_from_baseballr.RThat script:
- Fetches 2018–2025 pitching leaderboards from FanGraphs (requires internet)
- Keeps seasons with
IP >= 20 - Keeps pitchers with
>= 20 IPin either 2024 or 2025 - Writes
data/fangraphs_pitchers_2018_2025.csv
Rscript models/fit_sp_model.ROutputs:
models/sp_model_fit.rdsmodels/sp_model_inputs.rdsresults/projections/pitchers/sp_category_projections_2026.csv
Rscript models/fit_rp_model.ROutputs:
models/rp_model_fit.rdsmodels/rp_model_inputs.rdsresults/projections/pitchers/rp_category_projections_2026.csv
Rscript results/scripts/plot_latent_fit_top100_by_sp_category.ROutputs:
results/plots/fitted_outcome_curves/pitchers/starters/sp_latent_fit_top100_<CATEGORY>.pdf
Rscript results/scripts/plot_latent_fit_top100_sp_derived.ROutputs:
results/plots/fitted_outcome_curves/pitchers/starters/sp_latent_fit_derived_<METRIC>.pdf
Rscript results/scripts/plot_latent_fit_top100_by_rp_category.ROutputs:
results/plots/fitted_outcome_curves/pitchers/relievers/rp_latent_fit_top100_<CATEGORY>.pdf
Rscript results/scripts/plot_latent_fit_top100_rp_derived.ROutputs:
results/plots/fitted_outcome_curves/pitchers/relievers/rp_latent_fit_derived_<METRIC>.pdf
Rscript results/scripts/plot_2026_intervals_by_position.ROutputs:
results/plots/interval_projections/batters/projection_intervals_2026_<POSITION>.pdf
Rscript results/scripts/plot_2026_sp_intervals_by_role.ROutputs:
results/plots/interval_projections/pitchers/starters/sp_intervals_2026_<ROLE>.pdf
Rscript results/scripts/plot_2026_rp_intervals_by_role.ROutputs:
results/plots/interval_projections/pitchers/relievers/rp_intervals_2026_<ROLE>.pdf
Rscript results/scripts/build_sp_composite_projections.ROutputs:
results/projections/pitchers/sp_composite_projections_2026.csv
Rscript results/scripts/build_top50_pitchers_composite_by_role.ROutputs:
results/projections/pitchers/top50_pitchers_composite_by_role.md
Rscript results/scripts/build_rp_composite_projections.ROutputs:
results/projections/pitchers/rp_composite_projections_2026.csv
Rscript results/scripts/build_top50_pitchers_composite_by_role.ROutputs:
results/projections/pitchers/top50_pitchers_composite_by_role.md
- Current SP model is SP-only (2018–2025) and models SO, BB, H, ER, W, and QS as per-IP rates.
- SP outcomes are modeled as Poisson with a log(IP) offset.
- Role effects and SV/HLD are omitted in the current SP-only run.
- RP model uses 2018–2025 relievers only and models SO, BB, H, ER, W, and SVHLD as per-IP rates.
- RP outcomes are modeled as Poisson with a log(IP) offset.
- RP fit input excludes any pitcher with ATC-projected GS >= 1 (from
data/atc_ip_projections_2026.csv).
The sections below describe the hitter model. The current pitcher model uses the same backbone but is SP-only, uses IP instead of PA for the offset, and models SO, BB, H, and ER.
- Players
$i = 1..I$ , positions$p = 1..P$ , years$y = 1..Y$ - Outcomes
$k = 1..8$ , ordered:$(H, R, RBI, HR, SB, AVG, OBP, SLG)$ - Count outcomes:
$k = 1..5$ ; continuous outcomes:$k = 6..8$ - Observations indexed by
$n = 1..N$ , each with player$i[n]$ , position$p[n]$ , year$y[n]$
-
$PA_n$ : plate appearances for observation$n$ - Count outcomes:
$y_{n,k}$ for$k=1..5$ - Continuous outcomes:
-
$AVG_n, OBP_n \in (0,1)$ with logit transform -
$SLG_n > 0$ with log transform
-
- Transforms:
$a_n = \text{logit}(AVG_n)$ $o_n = \text{logit}(OBP_n)$ $s_n = \log(SLG_n + \varepsilon)$
-
$X_n$ : fixed effects row (intercept, age, age$^2$) -
$Z^{\text{pos}}_n$ : position random effect predictors (intercept, age, age$^2$) -
$Z^{\text{player}}_n$ : player random effect predictors (intercept, age)
- Count outcomes (per-PA rates via log offset):
equivalently:
- Continuous outcomes:
- Player random effects use
${\text{intercept}, \text{age}}$ :
- Position random effects use
${\text{intercept}, \text{age}, \text{age}^2}$ :
- Each
$\Sigma^{\text{group}}_r$ is constructed from scale vector$\sigma^{\text{group}}_r$ and correlation matrix$\Omega^{\text{group}}_r$ :
- For each outcome
$k$ :
- Draw
$\gamma_{k,Y+1} \sim \mathcal{N}(\rho_k\gamma_{k,Y}, \sigma_{\text{year},k})$ - Predict
$\eta_{n,k}$ for 2026 using age and age$^2$ (with age incremented by +1 from the most recent season), plus the drawn 2026 year effect
- Fixed effects (standardized predictors):
$\beta_k \sim \mathcal{N}(0, 2.5^2)$ - Random effect scales (half-normal):
$\sigma^{\text{player}}_r, \sigma^{\text{pos}}_r \sim \mathcal{N}^+(0, 1)$ - Non-centered random effects:
$z^{\text{player}}_r, z^{\text{pos}}_r \sim \mathcal{N}(0, 2.5^2)$ - Correlations:
$\Omega^{\text{group}}_r \sim \text{LKJ}(2)$ - Year AR(1) parameters:
$\rho_k \sim \mathcal{N}(0, 0.5)$ ,$\sigma_{\text{year},k} \sim \mathcal{N}^+(0, 1)$ - Continuous outcome noise:
$\sigma_k \sim \mathcal{N}^+(0, 1)$
- Count outcomes are forecasted as rates per PA; totals require a separate PA model.
- Seasons with PA < 100 are excluded from the dataset before fitting.