PHASM

Probabilistic hierarchical autoregressive sabermetric model

PHASM is a Bayesian projection system for MLB hitters and pitchers. It combines multivariate outcome modeling, hierarchical player/position effects, and AR(1) year trends to produce probabilistic forecasts of per-PA/per-IP rates and rate stats. The system also supports total-count projections when paired with external PA/IP forecasts.

What this does

Fits a joint multivariate Bayesian model (rstan) for H, R, RBI, HR, SB (per-PA rates) plus AVG, OBP, SLG.
Fits a joint multivariate Bayesian model (rstan) for SP: SO, BB, H, ER, W, QS (per-IP rates).
Fits a joint multivariate Bayesian model (rstan) for RP: SO, BB, H, ER, W, SVHLD (per-IP rates).
Uses age/aging curve and position (hitters); the current pitcher model is SP-only.
Player random intercepts and age slopes; position random intercepts and age/age^2 slopes.
Year random intercepts with AR(1) evolution.

Files

Hitter Stan model: models/model.stan
Hitter R driver: models/fit_model.R
SP Stan model: models/sp_model.stan
SP R driver: models/fit_sp_model.R
RP Stan model: models/rp_model.stan
RP R driver: models/fit_rp_model.R
Hitter inputs: data/fangraphs_batters_2018_2025.csv
Pitcher inputs: data/fangraphs_pitchers_2018_2025.csv
Hitter outputs (after fitting):
- models/model_fit.rds
- models/model_inputs.rds
- results/projections/batters/category_projections_2026.csv
SP outputs (after fitting):
- models/sp_model_fit.rds
- models/sp_model_inputs.rds
- results/projections/pitchers/sp_category_projections_2026.csv
RP outputs (after fitting):
- models/rp_model_fit.rds
- models/rp_model_inputs.rds
- results/projections/pitchers/rp_category_projections_2026.csv

Covariates used

Age (standardized) and age^2
SP-only pitcher model (no role random effects)

Notes

2026 covariates are taken from the most recent season per player (age advanced by +1).
Count outcomes are modeled as Poisson with a log(PA) offset; projections are per-PA rates.
Pitcher count outcomes are modeled as Poisson with a log(IP) offset; projections are per-IP rates.
AVG/OBP use a logit transform; SLG uses log(SLG + 1e-4).
Handedness and Statcast covariates are excluded by design.
Seasons with PA < 100 are excluded from the dataset before fitting.
Seasons with IP < 20 are excluded from the pitcher dataset before fitting.

Workflow

1) Generate the hitter dataset

This repo pulls FanGraphs data via baseballr and builds the model dataset:

Rscript data/build_fangraphs_batters_from_baseballr.R

That script:

Fetches 2018–2025 batting leaderboards from FanGraphs (requires internet)
Keeps seasons with PA >= 100
Keeps players with >= 100 PA in either 2024 or 2025
Writes data/fangraphs_batters_2018_2025.csv

2) Fit the hitter model (Stan)

Rscript models/fit_model.R

Outputs:

models/model_fit.rds
models/model_inputs.rds
results/projections/batters/category_projections_2026.csv

3) Composite projections (optional)

Rscript results/scripts/build_composite_projections.R

Outputs:

results/projections/batters/composite_projections_2026.csv

4) Top 20 composites by position (optional)

Rscript results/scripts/build_top20_composite_by_position.R

Outputs:

results/projections/batters/top20_composite_by_position.md

5) Hitter latent fit plots (optional)

Rscript results/scripts/plot_latent_fit_top100_by_category.R

Outputs:

results/plots/fitted_outcome_curves/batters/latent_fit_top100_<CATEGORY>.pdf

6) Generate the pitcher dataset

Rscript data/build_fangraphs_pitchers_from_baseballr.R

That script:

Fetches 2018–2025 pitching leaderboards from FanGraphs (requires internet)
Keeps seasons with IP >= 20
Keeps pitchers with >= 20 IP in either 2024 or 2025
Writes data/fangraphs_pitchers_2018_2025.csv

7) Fit the SP model (Stan)

Rscript models/fit_sp_model.R

Outputs:

models/sp_model_fit.rds
models/sp_model_inputs.rds
results/projections/pitchers/sp_category_projections_2026.csv

7b) Fit the RP model (Stan)

Rscript models/fit_rp_model.R

Outputs:

models/rp_model_fit.rds
models/rp_model_inputs.rds
results/projections/pitchers/rp_category_projections_2026.csv

8) SP latent fit plots (optional)

Rscript results/scripts/plot_latent_fit_top100_by_sp_category.R

Outputs:

results/plots/fitted_outcome_curves/pitchers/starters/sp_latent_fit_top100_<CATEGORY>.pdf

8b) SP derived latent fits (optional)

Rscript results/scripts/plot_latent_fit_top100_sp_derived.R

Outputs:

results/plots/fitted_outcome_curves/pitchers/starters/sp_latent_fit_derived_<METRIC>.pdf

8c) RP latent fit plots (optional)

Rscript results/scripts/plot_latent_fit_top100_by_rp_category.R

Outputs:

results/plots/fitted_outcome_curves/pitchers/relievers/rp_latent_fit_top100_<CATEGORY>.pdf

8d) RP derived latent fits (optional)

Rscript results/scripts/plot_latent_fit_top100_rp_derived.R

Outputs:

results/plots/fitted_outcome_curves/pitchers/relievers/rp_latent_fit_derived_<METRIC>.pdf

9) 2026 interval projections by position (optional)

Rscript results/scripts/plot_2026_intervals_by_position.R

Outputs:

results/plots/interval_projections/batters/projection_intervals_2026_<POSITION>.pdf

10) SP 2026 interval projections by role (optional)

Rscript results/scripts/plot_2026_sp_intervals_by_role.R

Outputs:

results/plots/interval_projections/pitchers/starters/sp_intervals_2026_<ROLE>.pdf

10b) RP 2026 interval projections by role (optional)

Rscript results/scripts/plot_2026_rp_intervals_by_role.R

Outputs:

results/plots/interval_projections/pitchers/relievers/rp_intervals_2026_<ROLE>.pdf

11) Pitcher composite projections (optional)

Rscript results/scripts/build_sp_composite_projections.R

Outputs:

results/projections/pitchers/sp_composite_projections_2026.csv

12) Top 50 SP composite by role (optional)

Rscript results/scripts/build_top50_pitchers_composite_by_role.R

Outputs:

results/projections/pitchers/top50_pitchers_composite_by_role.md

13) RP composite projections (optional)

Rscript results/scripts/build_rp_composite_projections.R

Outputs:

results/projections/pitchers/rp_composite_projections_2026.csv

14) Top 50 SP composite by role (SP + RP) (optional)

Rscript results/scripts/build_top50_pitchers_composite_by_role.R

Outputs:

results/projections/pitchers/top50_pitchers_composite_by_role.md

SP model notes

Current SP model is SP-only (2018–2025) and models SO, BB, H, ER, W, and QS as per-IP rates.
SP outcomes are modeled as Poisson with a log(IP) offset.
Role effects and SV/HLD are omitted in the current SP-only run.

RP model notes

RP model uses 2018–2025 relievers only and models SO, BB, H, ER, W, and SVHLD as per-IP rates.
RP outcomes are modeled as Poisson with a log(IP) offset.
RP fit input excludes any pitcher with ATC-projected GS >= 1 (from data/atc_ip_projections_2026.csv).

Model specification

The sections below describe the hitter model. The current pitcher model uses the same backbone but is SP-only, uses IP instead of PA for the offset, and models SO, BB, H, and ER.

Notation

Players $i = 1..I$, positions $p = 1..P$, years $y = 1..Y$
Outcomes $k = 1..8$, ordered: $(H, R, RBI, HR, SB, AVG, OBP, SLG)$
Count outcomes: $k = 1..5$; continuous outcomes: $k = 6..8$
Observations indexed by $n = 1..N$, each with player $i[n]$, position $p[n]$, year $y[n]$

Data and transforms

$PA_n$: plate appearances for observation $n$
Count outcomes: $y_{n,k}$ for $k=1..5$
Continuous outcomes:
- $AVG_n, OBP_n \in (0,1)$ with logit transform
- $SLG_n > 0$ with log transform
Transforms:
- $a_n = \text{logit}(AVG_n)$
- $o_n = \text{logit}(OBP_n)$
- $s_n = \log(SLG_n + \varepsilon)$

Design matrices

$X_n$: fixed effects row (intercept, age, age$^2$)
$Z^{\text{pos}}_n$: position random effect predictors (intercept, age, age$^2$)
$Z^{\text{player}}_n$: player random effect predictors (intercept, age)

Linear predictors (for each outcome k)

$$ \eta_{n,k} = X_n \beta_k + \sum_{r=1}^{R_{\text{pos}}} Z^{\text{pos}}_{n,r},u^{\text{pos}}_{p[n],k,r} + \sum_{r=1}^{R_{\text{player}}} Z^{\text{player}}_{n,r},u^{\text{player}}_{i[n],k,r} + \gamma_{k, y[n]}. $$

Likelihood

Count outcomes (per-PA rates via log offset):

$$ y_{n,k} \sim \text{Poisson}\bigl(\exp(\eta_{n,k}) \cdot PA_n\bigr), \quad k=1..5 $$

equivalently:

$$ y_{n,k} \sim \text{logPoisson}(\eta_{n,k} + \log(PA_n)). $$

Continuous outcomes:

$$ a_n \sim \mathcal{N}(\eta_{n,6}, \sigma_6), \quad o_n \sim \mathcal{N}(\eta_{n,7}, \sigma_7), \quad s_n \sim \mathcal{N}(\eta_{n,8}, \sigma_8). $$

Random effects

Player random effects use ${\text{intercept}, \text{age}}$:

$$ u^{\text{player}}_{i,*,r} \sim \mathcal{MVN}(0, \Sigma^{\text{player}}_r). $$

Position random effects use ${\text{intercept}, \text{age}, \text{age}^2}$:

$$ u^{\text{pos}}_{p,*,r} \sim \mathcal{MVN}(0, \Sigma^{\text{pos}}_r). $$

Each $\Sigma^{\text{group}}_r$ is constructed from scale vector $\sigma^{\text{group}}_r$ and correlation matrix $\Omega^{\text{group}}_r$:

$$ \Sigma^{\text{group}}_r = \text{diag}(\sigma^{\text{group}}_r), \Omega^{\text{group}}_r, \text{diag}(\sigma^{\text{group}}_r). $$

Year effects (AR(1))

For each outcome $k$:

$$ \gamma_{k,1} \sim \mathcal{N}\Bigl(0, \frac{\sigma_{\text{year},k}}{\sqrt{1-\rho_k^2}}\Bigr), \quad \gamma_{k,y} \sim \mathcal{N}(\rho_k \gamma_{k,y-1}, \sigma_{\text{year},k}),; y=2..Y. $$

2026 projection

Draw $\gamma_{k,Y+1} \sim \mathcal{N}(\rho_k\gamma_{k,Y}, \sigma_{\text{year},k})$
Predict $\eta_{n,k}$ for 2026 using age and age$^2$ (with age incremented by +1 from the most recent season), plus the drawn 2026 year effect

Priors (aligned with Stan prior recommendations)

Fixed effects (standardized predictors): $\beta_k \sim \mathcal{N}(0, 2.5^2)$
Random effect scales (half-normal): $\sigma^{\text{player}}_r, \sigma^{\text{pos}}_r \sim \mathcal{N}^+(0, 1)$
Non-centered random effects: $z^{\text{player}}_r, z^{\text{pos}}_r \sim \mathcal{N}(0, 2.5^2)$
Correlations: $\Omega^{\text{group}}_r \sim \text{LKJ}(2)$
Year AR(1) parameters: $\rho_k \sim \mathcal{N}(0, 0.5)$, $\sigma_{\text{year},k} \sim \mathcal{N}^+(0, 1)$
Continuous outcome noise: $\sigma_k \sim \mathcal{N}^+(0, 1)$

Notes

Count outcomes are forecasted as rates per PA; totals require a separate PA model.
Seasons with PA < 100 are excluded from the dataset before fitting.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
models		models
results		results
.gitignore		.gitignore
README.md		README.md

timwhite0/phasm

Folders and files

Latest commit

History

Repository files navigation

PHASM

Probabilistic hierarchical autoregressive sabermetric model

What this does

Files

Covariates used

Notes

Workflow

1) Generate the hitter dataset

2) Fit the hitter model (Stan)

3) Composite projections (optional)

4) Top 20 composites by position (optional)

5) Hitter latent fit plots (optional)

6) Generate the pitcher dataset

7) Fit the SP model (Stan)

7b) Fit the RP model (Stan)

8) SP latent fit plots (optional)

8b) SP derived latent fits (optional)

8c) RP latent fit plots (optional)

8d) RP derived latent fits (optional)

9) 2026 interval projections by position (optional)

10) SP 2026 interval projections by role (optional)

10b) RP 2026 interval projections by role (optional)

11) Pitcher composite projections (optional)

12) Top 50 SP composite by role (optional)

13) RP composite projections (optional)

14) Top 50 SP composite by role (SP + RP) (optional)

SP model notes

RP model notes

Model specification

Notation

Data and transforms

Design matrices

Linear predictors (for each outcome k)

Likelihood

Random effects

Year effects (AR(1))

2026 projection

Priors (aligned with Stan prior recommendations)

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages