myCPBL
GamesLeadersTeamsParks|ConstantsBlogAbout
Back to Blog

Calculating Linear Weights Without Play-by-Play Data

Jan 5, 20267 min readby myCPBL
Back to Blog
myCPBL - CPBL StatisticsPrivacy
Support

Linear Weights form the mathematical backbone of modern baseball statistics. Simply put, they answer the question: "How many runs is each batting event worth?" A home run is worth about 1.4 runs, while an out costs about -0.27 runs. These weights are essential for calculating wOBA, wRC+, and ultimately WAR.

But here's the problem: how do you calculate Linear Weights when you don't have play-by-play data? For the CPBL, detailed event logs only exist from 2018 onwards. What about the previous 28 seasons?


The Regression Approach (And Why It Fails)

The most intuitive approach is regression. Take team-level statistics—singles, doubles, home runs, walks, outs—and regress them against runs scored. The coefficients become your Linear Weights.

This method is elegant and requires only box score data, which exists for every season. But it has a fatal flaw: it severely underestimates the cost of an out.

Look at the comparison above. Regression estimates the Out value at around -0.04 runs, while the ground truth (calculated from actual play-by-play data) shows it's closer to -0.27 runs. That's a 7x difference.

Why does regression fail so badly at estimating outs? The answer lies in a fundamental constraint of baseball: every half-inning must have exactly three outs. This creates an opportunity cost that regression cannot capture.

Outs don't just fail to produce runs—they consume one of your three precious opportunities per inning. Regression treats outs as "events that produced no runs," but they're actually "events that used up a limited resource."


The RE-Based Approach (Ground Truth)

The gold standard for calculating Linear Weights uses the Run Expectancy (RE) matrix. For each base-out state (e.g., runner on first with one out), we calculate the expected runs until the end of the inning. The run value of each event is the change in run expectancy plus any runs scored.

This method captures the true value of outs because it explicitly accounts for the state changes. A single with no outs might increase run expectancy by 0.4 runs, while a single with two outs might only add 0.2 runs—and an out always decreases run expectancy significantly.

The problem? You need play-by-play data to build an RE matrix. For the CPBL, that means we can only calculate "true" Linear Weights for 2018-2025.


Our Solution: Monte Carlo Simulation + Calibration

Our approach combines simulation with empirical calibration. Here's how it works:

  1. Build a Markov Chain model of baseball transitions using league-average probabilities (singles, doubles, advancement rates, etc.)
  2. Run Monte Carlo simulations of thousands of at-bats for each event type, tracking the run expectancy changes
  3. Calculate calibration factors by comparing simulation results to RE-based ground truth (2018-2025)
  4. Apply calibration to adjust simulation results, accounting for any systematic bias

The key insight is that while absolute values may drift, the relative relationships between events remain stable. Our calibration factors capture this systematic offset and correct for it.


Validation: 97.5% Coverage

How do we know our method works? We validate it against the RE-based ground truth for 2018-2025. For each event type and year, we check whether the RE-based value falls within our simulation's 95% confidence interval.

97.5%
CI Coverage
78/80
Values Within CI

Of 80 event-year combinations (10 events x 8 years), 78 fall within our 95% confidence intervals. This 97.5% coverage gives us confidence that our simulation methodology produces valid estimates—even for years without play-by-play data.


Historical Trends (1990-2025)

With our validated methodology, we can now estimate Linear Weights for the entire history of the CPBL. Click on "Historical Trend" in the visualization above to explore how run values have changed over 35 seasons.

Some patterns emerge:

  • Home run values fluctuate with offensive environments—higher in pitcher-friendly eras (they're more valuable when scarce)
  • Out values have become more costly over time as run scoring has declined
  • Walk values remain stable around 0.3 runs across all eras

Technical Implementation

For reproducibility, our simulation uses seed-based random number generation. Each year's simulations use a deterministic seed, ensuring consistent results across runs while still capturing uncertainty through multiple iterations.

The confidence intervals incorporate two sources of uncertainty:

  1. Simulation variance: Natural variation across 20 independent simulation runs
  2. Calibration uncertainty: Year-to-year variation in calibration factors (from 2018-2025 data)

The combined variance formula is: Total Variance = Sim Variance + Calibration Variance


Conclusion

Linear Weights are fundamental to advanced baseball statistics, but calculating them without play-by-play data is challenging. Our Monte Carlo simulation approach, calibrated against ground truth data, provides a principled method for estimating historical run values.

The 97.5% CI coverage validates our methodology. And unlike regression, our approach correctly captures the true cost of an out—the most important (and most underestimated) event in baseball.

These Linear Weights are now integrated into myCPBL's wOBA, wRC+, and WAR calculations, providing historically-grounded statistics for all 35 seasons of the CPBL.


Methodology Note

Monte Carlo simulation with Markov Chain transitions. 20 runs per year, 10,000 PA samples per event type. Calibration factors derived from 2018-2025 RE-based ground truth. 95% CI = mean +/- 1.96 * sqrt(sim_variance + calibration_variance).