Blog/March 6, 2026·12 min read

We Built Our Own Umpire Scorecards From Scratch. Here's How They Work.

Every pitch tracked. Every missed call quantified in win probability. Here's how we built an umpire analysis engine from raw Statcast data — and why we measure impact differently than anyone else.

methodologyumpireslaunch

Every ball and strike call an umpire makes changes a game. Sometimes by a little. Sometimes by a lot. And until Statcast came along, we had no reliable way to measure which was which.

We built our own umpire analysis from scratch — not a recreation of what others have done, but a different model with different priorities. We measure impact in win probability, not just expected runs. We grade against the rulebook strike zone, not a model trained on how umpires historically call the game. And we quantify the importance of the moment, not just the call itself.

This is how we built it, why we made the choices we did, and what the numbers actually mean.

The Raw Material: Statcast Pitch Tracking

Everything starts with Baseball Savant, the public data arm of Statcast. For every pitch thrown in an MLB game, Savant publishes a row of data — the pitch type, velocity, spin, movement, and most importantly for our purposes: where the ball crossed the plate.

Specifically, we use four fields:

plate_x — horizontal location as the ball crosses home plate, in feet from center
plate_z — vertical location, in feet above the ground
sz_top and sz_bot — the individual batter's strike zone boundaries, measured by Hawk-Eye's optical tracking system on each pitch

Hawk-Eye tracks pitch location to roughly plus-or-minus 0.25 inches. That precision matters, and we'll come back to it.

We also use the description field, which tells us how the pitch was called: called_strike, ball, and so on. Our analysis uses only called pitches — swings, foul balls, and hit batters are excluded entirely. We're grading the umpire's judgment, not the batter's decisions.

Drawing the Zone

The rulebook strike zone is a rectangle: 17 inches wide (the width of home plate), with vertical boundaries defined by each batter's stance. In practice, Statcast's sz_top and sz_bot values turn those words into measurements for every individual pitch.

Our zone is that rectangle. It's not a probabilistic model. It's not trained on historical call data. It's the dimensions the rulebook specifies.

This is a deliberate choice that separates us from most umpire grading systems. Some approaches train a probabilistic zone model on how pitches are actually called league-wide — a pitch near the corner gets assigned some probability of being a strike based on how often umpires historically call it one. The result is a rounder, softer zone that reflects consensus behavior rather than the rulebook.

The problem with that approach: you end up grading umpires against umpire behavior. If the entire league ignores the low-outside corner, that pitch effectively becomes a ball in the model, and no individual umpire is held accountable for missing it. The rulebook gets rewritten by consensus.

We use the rectangle. If a pitch lands in the rulebook zone and gets called a ball, that's a missed call — even if most umpires miss it too.

The Borderline Margin

Hawk-Eye measures pitch location to about plus-or-minus 0.25 inches. A pitch whose center sits exactly on the edge of the zone could be a quarter inch inside or outside — the measurement can't tell us. Calling that a "missed call" would be unfair to the umpire. The uncertainty in the tracking exceeds the distance from the line.

We apply a 0.5-inch borderline margin on all four edges of the zone — twice the tracking precision. Pitches within this band are classified as borderline and excluded from the accuracy calculation entirely. They don't count as correct calls or missed calls.

We validated this margin against a 250-game sample from the 2024 season. At 0.5 inches, roughly 4.6% of called pitches land in the borderline band. That felt right: wide enough to account for measurement uncertainty, narrow enough that we're not excusing clear misses.

One thing we're explicitly not doing: adding the ball's radius (about 1.45 inches) to account for whether the ball was "clipping the zone." The Statcast data gives us the center of the ball, and that's what we evaluate. A pitch whose center is two inches outside the plate was outside.

Counting Accuracy

Once we've classified each called pitch as clearly inside, clearly outside, or borderline, accuracy is straightforward:

Accuracy = correct calls / total non-borderline called pitches

A correct call is one where the umpire's decision matched the zone classification. A pitch clearly inside the zone called a strike: correct. A pitch clearly outside called a ball: correct. Either of those reversed: missed call.

We further classify each missed call as one of two types:

Ball called strike (BCS) — the pitch was outside the zone but called a strike
Strike called ball (SCB) — the pitch was inside the zone but called a ball

This directional breakdown matters. A BCS expands the zone (favoring the pitcher). An SCB shrinks it (favoring the batter). The same umpire can have very different tendencies in each direction, and we track both.

Measuring Impact: Why Win Probability

Most umpire analysis measures the impact of missed calls in expected runs — using a framework called RE24 (Run Expectancy across 24 base-out states). A missed call costs a team some fractional number of expected runs based on the count and situation.

We measure impact in win probability. Here's why the difference matters:

A missed strikeout call in the bottom of the 9th inning of a tie game is worth far more than the same missed call in the 3rd inning of a 6-run blowout. Run expectancy doesn't distinguish between those situations. Win probability does.

The specific metric we report is WPA impact — the change in the home team's win probability caused by each missed call. Positive means the call favored the home team; negative means it favored the visitors.

How We Calculate the Counterfactual

The technical challenge: figuring out what win probability would have been with the correct call. The correct call changes the count, and a different count leads to different probabilities of every possible outcome.

Here's the framework:

For any missed call, we know the actual count (what the ump called), the counterfactual count (what the correct call would have produced), and the full game state — inning, score, runners, outs.

We compute the expected WPA at each count:

E[WPA | count] = sum of P(outcome | count) × WPA(outcome | game state)

The outcome probabilities — how often each count leads to a walk, strikeout, single, double, etc. — come from counting actual plate appearance outcomes across roughly 3.6 million Statcast pitches from 2021 through 2025.

The WPA for each possible outcome comes from our win expectancy tables, which cover 16,094 unique game states (combinations of inning, half, outs, runners, and score differential).

The impact of the missed call is the difference:

Impact = E[WPA | correct count] − E[WPA | actual count]

Terminal Calls

Some missed calls end the at-bat outright. A ball called strike on a 2-2 count makes it 0-2... wait, no. A ball called strike on a 0-2 count produces a strikeout. A strike called ball on a 3-0 count produces a walk. These are terminal calls, and they're handled as a special case.

When a missed call causes a walk or strikeout that shouldn't have happened, the impact is the full WPA difference between those outcomes in that specific game state. No probability averaging needed — the outcome is determined.

Terminal calls are almost always the most consequential missed calls in a game. A strikeout-that-should-have-been-a-walk in a high-leverage spot can swing win probability by 15-25%.

Leverage Index: When It Matters

Not every moment is equally important. A missed call in the 9th inning of a tie game matters more than the same call in the 3rd inning of a blowout. We quantify this with Leverage Index (LI).

LI measures how much win probability could swing on the next play, relative to the average across all game states:

LI = expected |WPA| in this state / average expected |WPA| across all states

An LI of 1.0 is average. An LI of 3.0 means whatever happens next will swing win probability three times more than a typical play. An LI of 0.2 means the game is essentially decided.

We label each missed call:

VERY HIGH — LI of 3.0 or above
HIGH — LI of 2.0 to 3.0
MEDIUM — LI of 1.0 to 2.0
LOW — LI below 1.0

When we report that an umpire had 4 high-leverage missed calls, that means 4 calls went wrong in spots where the game was hanging in the balance. That's different from 4 missed calls in garbage time, and our numbers reflect the distinction.

Handedness Splits

Some umpires call a systematically different zone for left-handed batters versus right-handed batters. We track accuracy separately for each and report both on every umpire's profile page.

These splits reveal tendencies that aggregate accuracy hides. An umpire might call 93% accurately overall but break down to 95% versus right-handed batters and 90% versus left-handed batters. That gap is wide enough to affect pitch selection and at-bat outcomes, and it's consistent enough in some umpires to be a real tendency, not noise.

The LHB/RHB accuracy gap is one of the cleanest signals in our dataset. Across five seasons, the league-wide pattern is clear: left-handed batters face a slightly wider called zone on the outside edge. Individual umpires vary significantly in how pronounced this gap is.

Era-Relative Grading

We grade umpires on a curve — specifically, relative to the season they worked in. Letter grades (A+ through F) are assigned based on how many standard deviations each umpire's accuracy falls above or below the season mean.

This matters because the league-wide accuracy baseline shifts year to year. The pitch clock era (2023 onward) changed the rhythm of at-bats, and accuracy distributions shifted with it. A 91.5% in 2021 means something different than 91.5% in 2024. Grading within a season makes the comparison fair.

The scale:

A+: 1.5 standard deviations above the mean or higher
A: +1.0σ to +1.5σ
B+: +0.5σ to +1.0σ
B: roughly average (−0.25σ to +0.5σ)
C+: −0.75σ to −0.25σ
C: −1.25σ to −0.75σ
D: −1.75σ to −1.25σ
F: below −1.75σ

By construction, roughly half of all umpires land in the B range in any given season. That's not grade inflation — it's what a mean-centered system looks like. The grades at the tails are where the story is: the A+ umpires who are genuinely elite, and the F umpires who are consistently below their peers.

The Pre-Computed Lookup Tables

Running this calculation live for every pitch would be slow. So we pre-compute the heavy lifting.

The pipeline works in stages. First, we compute outcome probabilities for each of the 12 ball-strike counts from Statcast data. Then, for each of 16,094 game states, we compute the expected WPA at each count. The final tables — about 23MB of JSON — cover every combination of call type, count, and game state you'll encounter in a real game.

During a live game, calculating the impact of a missed call is a pair of lookups and a subtraction. The entire scorecard for a game computes in milliseconds.

We rebuild these tables once a year in the offseason, incorporating the latest season's data. The current tables span 2021-2025, roughly 3.6 million pitches.

What We Don't Claim

We're rigorous about what these numbers can and can't tell you.

We can't know what actually would have happened. Our impact estimates use historical averages across millions of plate appearances. The specific batter, pitcher, and pitch sequence would have been different with the correct count. The counterfactual is unknowable — we estimate it.

Pitch selection would have changed. If the count were 3-1 instead of 2-2, the pitcher throws a different pitch. Our model uses league-average outcome probabilities for each count, not matchup-specific ones. The actual impact in a specific at-bat may differ from our estimate.

Tracking isn't perfect. Hawk-Eye is excellent, but not infallible. Some pitches we classify as missed calls may have had part of the ball touching the zone edge. Our borderline margin accounts for this conservatively, but it doesn't eliminate uncertainty entirely.

We quantify where we can, and we're honest about where we can't.

The Competitive Landscape

A word about what makes this different from existing umpire analysis.

UmpScorecards built a large following grading umpires on social media. Their methodology was never fully published, and the account has gone dormant. We publish our methodology and our data sources. Every number on our site traces back to a public Statcast CSV or MLB Stats API endpoint.

Baseball Savant provides the raw pitch data we build on. They are a data warehouse; we are an analysis layer on top of their data. Savant tells you where the pitch was. We tell you what the call cost.

FanGraphs publishes RE24 tables and run environment data that inform our constants. They don't produce game-level umpire scorecards. We operate in the space they leave empty.

Our differentiator is the combination: WPA-based impact (not just run value), leverage-weighted analysis, handedness splits, era-relative grading, five years of longitudinal data, and full transparency on every step.

See It in Action

Every game page on The Best Box Score includes an umpire scorecard tab when Statcast data is available. You'll see:

A strike zone SVG with each missed call plotted — filled dots for balls called strikes, hollow dots for strikes called balls
The accuracy grade and percentage
Net WPA favor (home vs. away)
Net run favor
Every missed call listed with its WPA impact, leverage label, and game context
Handedness splits (accuracy vs. LHB and RHB)

The umpire leaderboard shows full season rankings for 2021 through 2025 — accuracy, grades, net favor, and zone tendency for every qualified umpire. Click any name for a full profile with game-by-game logs and their worst calls of the season.

And if you want to go deeper on any specific formula or constant, the methodology page has the complete technical writeup.

The calls are public. The data is public. The analysis is ours. And now it's yours.