The Best Box Score

Methodology

Every metric on The Best Box Score is computed from raw play-by-play and pitch-level data. No made-up weights, no borrowed analysis, no black boxes. Every number traces back to counted historical occurrences.

Philosophy

We build original analysis from public data. Our win probability model, ump audits, leverage calculations, and game narratives are our own work — not reproductions of what other sites publish.

Three principles guide every metric we ship:

  1. Fully data-driven. Every probability and weight traces to counted occurrences in historical data. No expert judgment, no assumed distributions.
  2. Show the receipts. Any number on the site can be drilled into: the underlying sample size, the source dataset, and the exact calculation.
  3. Transparent uncertainty. Where we're estimating, we say so. Confidence intervals come from the data itself, not assumptions.

Win Probability Model

Our win probability model uses pre-computed lookup tables derived from over 120 years of MLB game outcomes. For any combination of inning, batting team, outs, base runners, and score differential, we know the historical probability that the home team wins.

Win Expectancy Tables

The foundation is Greg Stoll's win expectancy tables, computed from Retrosheet play-by-play data spanning 1903–2024. This covers 16,094 unique game states across 15+ million historical games.

Example receipt: Bottom of the 9th, 2 outs, bases loaded, tie game — the home team won 4,521 of 8,934 such games historically (50.6%). That's not a model prediction. It's a counted fact.

Game State Definition

Each game state is defined by five dimensions:

Inning:       1 through 9+ (extra innings normalized to 9)
Batting team: Home (bottom) or Visitor (top)
Outs:         0, 1, or 2
Runners:      8 configurations (empty through bases loaded)
Score diff:   Home score minus away score (clamped to ±10)

WPA (Win Probability Added)

The WPA for each play is simply the change in win probability from one state to the next. A bases-clearing double in the 9th of a tie game produces a large WPA swing. A routine groundout in a blowout produces almost none.

WPA = Win Probability (after play) − Win Probability (before play)

The interactive chart on each game page plots win probability after every play, showing the full narrative arc of the game.

Counterfactual Analysis

Our most original contribution is a unified counterfactual framework for measuring the impact of events that change the count — most notably, missed umpire calls. Instead of asking “what did happen?” we ask “what would the expected outcome have been with the correct call?”

The Core Formula

Impact = E[WPA | counterfactual count] − E[WPA | actual count]

Where the expected WPA at any count is:

E[WPA | count] = Σ P(outcome | count) × WPA(outcome | game state)

This formula handles every case with no special logic. Terminal calls (walk or strikeout) are simply the degenerate case where all probability mass concentrates on one outcome.

How It Works

Consider a ball called strike on a 2-0 count. The actual count becomes 2-1; the counterfactual (correct) count would have been 3-0.

Actual count: 2-1
  P(walk | 2-1)  = 12.3%    ← from Statcast 2021-2024
  P(K | 2-1)     = 18.1%
  P(single | 2-1)= 16.2%
  ...

Counterfactual count: 3-0
  P(walk | 3-0)  = 41.2%    ← from Statcast 2021-2024
  P(K | 3-0)     =  5.1%
  P(single | 3-0)= 18.3%
  ...

Each outcome's WPA depends on the game state (inning,
score, runners, outs). We compute the probability-weighted
average across all outcomes for each count, then take the
difference.

Impact = E[WPA | 3-0] - E[WPA | 2-1]

Terminal vs. Non-Terminal Calls

For a missed call on a 3-2 count, the math is deterministic — ball 4 means a walk, strike 3 means a strikeout. We know exactly what should have happened:

Ball called strike on 3-2 count:
  Actual: strikeout    → P(strikeout) = 1.0
  Correct: walk        → P(walk) = 1.0
  Impact = WPA(walk) - WPA(strikeout)   # no uncertainty

For non-terminal calls (e.g., a missed call on a 1-1 count), there are many possible outcomes from both the actual and counterfactual counts. We compute the expected value across all of them. The same formula handles both cases seamlessly.

Outcome Application

When computing the WPA of each possible outcome, we model how the game state changes using real data:

  • Walks/HBP: Deterministic — batter to first, forced runners advance, bases-loaded walk scores a run.
  • Strikeouts: Deterministic — add one out, runners hold. If 3rd out, inning changes.
  • Hits: Probabilistic — runner advancement probabilities from Statcast (e.g., runner on 2nd scores on a single 63% of the time).
  • Outs (non-K): Weighted by out type (groundout, flyout, lineout, popup) from Statcast, then by advancement probability. Double play rates computed separately by runner/out configuration.
  • Home runs: Deterministic — all runners score plus batter.

Ump Audit

We built our own umpire analysis from scratch using Statcast pitch data and the counterfactual framework described above. We report impact in win probability, not expected runs.

Why Win Probability Instead of Runs?

Most umpire grading sites (including @UmpScorecards) report favor in expected runs using RE24 tables. Our approach differs:

Expected Runs (others)Win Probability (us)
Output“+0.42 runs favored NYY”“+2.3% win probability toward NYY”
ContextSame value regardless of game situationAccounts for inning, score, runners, outs
IntuitionWhat does “0.3 runs” feel like?“Cost them 3% chance to win” is immediate
UncertaintySingle numberRange with confidence interval

A missed call in the 9th inning of a tie game matters more than the same call in the 3rd of a blowout. Our model captures that; run-based models cannot.

Strike Zone Definition

We use the MLB rulebook strike zone: 17 inches wide (the width of home plate), with the top and bottom defined per batter using Statcast's sz_top and sz_bot values for each pitch. The zone is a rectangle. That's what the rulebook says.

Borderline Margin

Statcast's Hawk-Eye system tracks the center of the baseball to approximately ±0.25 inches. We apply a 0.5-inch borderline margin (twice the tracking precision) on all edges of the zone. Pitches whose center lands within this margin of the zone edge are classified as “borderline” and not counted as missed calls in either direction.

For context, a regulation baseball has a diameter of approximately 2.9 inches (1.45-inch radius). A pitch whose center is 1.45 inches outside the zone edge could still have part of the ball clipping the zone. Our 0.5-inch margin is deliberately conservative — we are not trying to account for ball width, only for measurement uncertainty. The margin was validated against 250 games from the 2024 season: at 0.5 inches, roughly 4.6% of called pitches fall in the borderline band.

Why Our Accuracy Differs From Other Sites

Our umpire accuracy numbers run approximately 2% lower than sites like @UmpScorecards. This is a deliberate methodological choice, not an error.

Other sites train a probabilistic zone model on how pitches are actually called league-wide, then grade individual umpires against that model. The result is a rounder, softer zone — because umpires historically don't call the corners, the model “learns” that corner pitches aren't strikes. A pitch on the low-outside corner that gets called a ball isn't flagged as a miss, because no umpire calls that pitch.

The problem with this approach is circularity: you are grading umpires against umpire behavior. If every umpire ignores the low-outside corner, that pitch effectively becomes a ball in the model, and no one is ever held accountable for missing it. The rulebook gets rewritten by consensus.

We use the actual rectangle. If a pitch is in the rulebook zone and gets called a ball, that's a missed call — even if most umpires miss it too. This makes our numbers stricter, but it means we're measuring against the rules of the game, not against the average of how the rules are enforced.

Missed Call Detection

A pitch is flagged as a missed call when:

  • It falls clearly inside the zone (beyond the borderline margin) but was called a ball, or
  • It falls clearly outside the zone (beyond the borderline margin) but was called a strike

Each missed call is assigned a WPA impact using the counterfactual framework. The total favor is expressed from the home team's perspective (+/−), and leverage labels (HIGH, VERY HIGH) indicate how much was at stake when the call was made.

Leverage Index

Leverage Index (LI) measures the importance or pressure of a game situation. It answers the question: how much does this moment matter?

LI = E[|WPA|] in this state ÷ average E[|WPA|] across all states

Where the expected absolute WPA is the probability-weighted average of how much win probability could swing on the next play:

E[|WPA|] = Σ P(outcome) × |WPA(outcome)|

We compute this from real outcome probabilities and real win expectancy swings for every game state, using the 0-0 count as the neutral baseline (leverage is about the situation, not the current count).

LabelLI RangeMeaning
VERY HIGH≥ 3.0Every pitch is a pressure cooker
HIGH≥ 2.0Close game, meaningful at-bat
Average~1.0Typical importance (by definition)
LOW< 0.5Blowout or low-stakes early innings

Leverage labels appear on missed calls in the ump audit to communicate which calls mattered most — a missed call at LI 5.0 is far more consequential than one at LI 0.3.

Statcast Integration

Pitch-level data comes from Baseball Savant CSV exports. For each pitch, we have velocity, spin rate, movement, location, pitch type, and the batter's strike zone boundaries. For plate appearances, we have exit velocity, launch angle, expected batting average (xBA), and expected weighted on-base average (xwOBA).

This data is available for all MLB and most AAA games. Availability decreases at lower minor league levels — the game page gracefully degrades when Statcast data is unavailable, showing traditional stats only.

Data Pipeline

The ump audit and leverage calculations rely on pre-computed lookup tables rather than runtime computation. This is what makes real-time analysis possible — when a game is live, we look up impact values instantly rather than running simulations.

Pipeline Architecture

Raw data from two sources flows through a multi-stage aggregation pipeline to produce the final lookup tables:

Greg Stoll (1903-2024)        Statcast (2021-2024)
  Retrosheet play-by-play        ~10 million pitches
  15M+ historical games          ~2.5M plate appearances
         │                              │
         ▼                              ▼
  Win Expectancy Table         Outcome Probabilities
  (16,094 game states)           (12 count states)
         │                              │
         │                    ┌─────────┼──────────┐
         │                    ▼         ▼          ▼
         │              Advancement  Out Type    Double Play
         │              Probs       Distrib.    Probs
         │                    │         │          │
         └────────┬───────────┴─────────┘          │
                  ▼                                │
         Expected WPA Table                        │
         (16K states × 12 counts)                  │
                  │                                │
         ┌───────┴────────┐                        │
         ▼                ▼                        ▼
  Call Impact Table   Leverage Index         (used in
  (missed call WPA    (situation              Expected WPA
   by count & state)   importance)            calculation)

What Each Table Contains

  • Win Expectancy — Historical win probability for each of 16,094 game states, derived from 120+ years of MLB outcomes.
  • Outcome Probabilities — P(walk), P(strikeout), P(single), P(double), P(triple), P(home run), P(out), P(HBP) for each of the 12 ball-strike counts. Derived from counting actual outcomes in Statcast.
  • Runner Advancement — P(runner destination | hit type, runner origin). For example, a runner on 2nd scores on a single 63% of the time.
  • Expected WPA — The probability-weighted expected WPA for each count in each game state. This is the key table used in counterfactual analysis.
  • Call Impact — The final product: pre-computed WPA impact for each type of missed call, at each count, in each game state. Enables instant lookups during live games.
  • Leverage Index — Situation importance for all 16,094 game states, computed from real outcome probabilities and win expectancy swings.

Data Sources

All data is ethically sourced, publicly available, and fully traceable.

SourceWhat We Use It ForAccess
MLB Stats APISchedules, live feed, boxscores, play-by-play, player infoPublic API
Baseball SavantStatcast pitch-tracking, exit velocity, xBA, xwOBAPublic CSV
RetrosheetHistorical play-by-play (via Greg Stoll's tables)Free with attribution
Greg StollWin expectancy tables (1903–2024)Open source

Data is fetched live with a 30-second cache. Occasional gaps or delays may occur during active games.

What we don't use: We do not scrape FanGraphs, Baseball-Reference, UmpScorecards, or any other analytics site. Every metric on this site is computed from the raw sources listed above.

Limitations

We are honest about what our models can and cannot do.

  • Counterfactual unknowability. When we estimate what “would have happened” with a different count, we use historical averages. The specific batter/pitcher matchup may differ from average. We quantify this uncertainty but cannot eliminate it.
  • Pitch sequence changes. A different count leads to different pitch selection. If the count were 3-0 instead of 2-1, the pitcher throws a different pitch. We cannot model that — nobody can.
  • Average behavior, not matchup-specific. Our outcome probabilities are league-wide averages. Juan Soto at 3-0 is different from a rookie at 3-0. We use the average.
  • Measurement precision. Hawk-Eye tracks pitch location to approximately ±0.25 inches. Our 0.5-inch borderline margin accounts for this, but does not model the full ball diameter (2.9 inches). Some pitches we flag as missed calls may have had part of the ball clipping the zone edge.
  • Historical data applicability. Outcome probabilities come from 2021–2024. Win expectancy tables span 1903–2024. Future baseball may differ from historical patterns.

References