Beat the Streak v2

A PA-level MLB hit prediction model that beats the current state of the art on backtested data.

MLB's Beat the Streak is a free contest with a $5.6 million prize: pick one player each day who you think will get at least one hit. String together 57 correct picks in a row — beating Joe DiMaggio's 56-game hitting streak — and you win. Since the contest launched in 2001, nobody has won.

This project investigates whether it's possible to build a model that makes this achievable, and what that model looks like.

Results

Walk-forward backtested, provably leak-free, validated across 6 MLB seasons:

Metric	BTS v2	SOTA (Garnett 2026)
P@1 (daily best pick)	87.0%	~85%
P@100 (top 100 picks)	91.0%	85%
P@500 (depth of signal)	87.2%	77%

The P@500 advantage is the most robust finding — the model beats SOTA on this metric in every season tested (2020-2025).

Multi-season validation

Season  Training data    P@1     P@100   P@500
2020    2019             89.6%   83.0%   83.0%
2021    2019-20          89.0%   87.0%   86.4%
2022    2019-21          90.5%   92.0%   89.0%
2023    2019-22          87.9%   95.0%   88.4%
2024    2019-23          81.6%   87.0%   85.8%
2025    2019-24          87.0%   91.0%   87.2%

What this means for BTS

At 87% P@1, streak simulation over 50,000 seasons gives a median longest streak of 26 games and a 57+ game streak in ~0.6% of seasons. Still very unlikely, but meaningfully better than the mathematical baseline.

The Story

v1: Counting to 57 (2021)

The original project achieved ~71.4% daily accuracy using Decision Trees on game-level Statcast data. The conclusion then: "the mathematical structure of the problem, not the quality of the model, is what makes Beat the Streak essentially unwinnable."

That conclusion was correct — at 71.4%. The question this project asks is: how much can we improve that number, and what does it take?

What the field built (2020-2026)

In the intervening years, several researchers attacked this problem:

Alceo & Henriques (2020): MLP achieving 85% Precision@100. The academic benchmark.
Garnett (2026): MLP + LightGBM ensemble matching 85% P@100 but degrading more slowly at depth (77% P@500). Identified lineup position as the most important feature (3.8x more than #2).

All existing approaches model at the batter-game level — one prediction per batter per day. They aggregate pitch-level data into game-level features, losing PA-level granularity.

v2: The PA-level insight

The game-level question decomposes naturally:

P(>=1 hit in game) = 1 - PROD(P(no hit in each PA))

If we model hit probability at the plate appearance level and aggregate, we get:

More training data. ~180K batter-game records per season become ~180K PAs per season, with ~4 PAs per batter-game giving us richer signal per record.
Lineup position becomes natural. The PA model doesn't need it as a feature — a leadoff hitter's advantage comes from having 4.5 PAs instead of 3.6, which the aggregation handles automatically.
Pitcher matchup at the right level. Platoon splits, pitcher quality, and pitch arsenal features are PA-level phenomena, not game-level.

The leakage story

The most important part of this project was finding and fixing three separate data leakage bugs:

Leakage #1: Static features using future data. platoon_hr and batter_gb_hit_rate were computed over ALL data, including the test period. A batter's September 2025 performance leaked into their April 2025 predictions. Fix: expanding calculations with date-level shift(1).

Leakage #2: K-Means clustering instability. Pitcher archetypes were clustered on the full dataset. When re-clustered on training data only, 90.8% of pitchers changed clusters — making batter_vs_arch_hr (our top feature by importance!) entirely dependent on future data. Fix: removed clustering features entirely.

Leakage #3: Doubleheader contamination. shift(1) on game-level features meant game 2 of a doubleheader could see game 1's results. Fix: aggregate to date level before shifting.

The impact: P@1 dropped from 94% (leaky) to 87% (clean). Every leakage bug looked like a legitimate improvement until rigorously tested. This is consistent with Garnett's warning that "data leakage via iterative test evaluation is the most common invisible failure mode."

Verification: A "nuclear test" manually recomputes every feature from scratch for random test PAs using only pre-date data, comparing against the pipeline's output. 260/260 checks passed.

Architecture

Two-stage PA-level model:

MLB Stats API -> raw JSON -> PA-level Parquet -> 13 temporal features -> LightGBM -> P(hit|PA) -> P(>=1 hit|game)

Features (13)

Feature	Type	What it captures
batter_hr_{7,30,60,120}g	Rolling	Short-term form through long-term ability
batter_whiff_60g	Rolling	Contact quality
batter_count_tendency_30g	Rolling	Plate discipline
batter_gb_hit_rate	Expanding	Speed (infield hit proxy)
platoon_hr	Expanding	Batter x pitcher handedness matchup
pitcher_hr_30g	Rolling	Pitcher quality
pitcher_entropy_30g	Rolling	Arsenal diversity (harder to hit)
weather_temp	Context	Temperature affects ball flight
park_factor	Expanding	Venue effect on hit rates
days_rest	Context	Rust after time off

Every feature is provably temporal — only data from dates strictly before the prediction date is used.

What didn't work

Lineup position as a PA feature: Double-counts with the aggregation step
Home/away: No PA-level effect
Pitcher archetypes (K-Means): Cluster assignments unstable across train/test splits
MLP ensemble: Trees handle our interaction features better
Umpire zone tendency: Real effect (~3.6% CSR variance) but zero predictive power
Exit velocity trends, wind, career PAs, day of week: All noise
Calibration (Platt/isotonic): Hurts P@K by reducing training data for base model
Training data before 2019: Game changed enough that old data hurts (-1.1%)

Data

Source: MLB Stats API v1.1 (game feed endpoint)
Scope: 9 seasons (2017-2025), 1.5M plate appearances
Training window: 2019 onward (optimal)
Filters: Regular season only. 7-inning COVID doubleheaders (2021) excluded.

Usage

# Install
uv sync

# Pull game data
bts data pull --start 2019-03-20 --end 2025-10-01

# Build PA Parquet
bts data build --seasons 2019,2020,2021,2022,2023,2024,2025

# Evaluate (Python)
from bts.features.compute import compute_all_features
from bts.evaluate.backtest import walk_forward_evaluate

df = compute_all_features(pd.concat([
    pd.read_parquet(f"data/processed/pa_{y}.parquet")
    for y in range(2019, 2026)
]))
metrics, preds = walk_forward_evaluate(df, test_season=2025)

Key Learnings

PA-level modeling beats game-level. More data, natural lineup position handling, pitcher matchup at the right granularity.
Leakage is invisible until you test for it. Three bugs, each discovered only through systematic verification. The "nuclear test" should be standard practice.
Feature selection changes completely when leakage is present. Our top feature under leakage (batter_vs_arch_hr) was entirely fake. Features that "hurt" under leakage (lineup_position) helped with clean data, and vice versa.
More data helps, to a point. Each additional training season improves the model, but data older than ~6 years hurts because the game changes.
Simpler models with clean features beat complex models with noisy ones. 13 features > 18. LightGBM alone > LightGBM + MLP. Default hyperparameters are fine.
P@1 has wide confidence intervals. +/-5% on a 184-day season. Multi-season validation is essential.

References

Garnett (2026), "Chasing $5.6 Million with Machine Learning" — current SOTA
Alceo & Henriques (2020), "Beat the Streak: Prediction of MLB Base Hits Using ML" — academic benchmark
Grinsztajn et al. (NeurIPS 2022), "Why do tree-based models still outperform deep learning on tabular data?"
McElfresh et al. (NeurIPS 2023), "When Do Neural Nets Outperform Boosted Trees on Tabular Data?"
Nathan, "Physics of Baseball" — ball flight physics, air density effects

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
docs		docs
scripts		scripts
src/bts		src/bts
tests		tests
.gitignore		.gitignore
.python-version		.python-version
ARCHITECTURE.md		ARCHITECTURE.md
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beat the Streak v2

Results

Multi-season validation

What this means for BTS

The Story

v1: Counting to 57 (2021)

What the field built (2020-2026)

v2: The PA-level insight

The leakage story

Architecture

Features (13)

What didn't work

Data

Usage

Key Learnings

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Beat the Streak v2

Results

Multi-season validation

What this means for BTS

The Story

v1: Counting to 57 (2021)

What the field built (2020-2026)

v2: The PA-level insight

The leakage story

Architecture

Features (13)

What didn't work

Data

Usage

Key Learnings

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages