Why Model Calibration Matters More Than Accuracy in Football Predictions
A model that's 60% accurate sounds impressive — until you realize it says "70% confident" on predictions that only come true 55% of the time. That gap between stated confidence and actual reliability is the calibration problem, and it's the difference between a useful model and a dangerous one.
Accuracy vs Calibration
Most people evaluate prediction models by accuracy: "What percentage of predictions were correct?" This is intuitive but deeply misleading for probabilistic predictions.
Consider two models predicting 100 football matches:
| Model | Accuracy | Behavior |
|---|---|---|
| Model A | 54% | Predicts the home team every time |
| Model B | 54% | Outputs calibrated probabilities for each match |
Both have the same accuracy, but Model B is far more useful. When Model B says "75% home win," it means that among all matches where it predicted ~75%, roughly 75% actually resulted in home wins. Model A gives you no such information — it's just always picking home.
Accuracy tells you how often the model is right. Calibration tells you how much you can trust the probabilities it outputs. For anyone making decisions based on those probabilities — especially financial decisions — calibration is what matters.
What Is Calibration?
A model is perfectly calibrated if, for every probability it outputs, the actual frequency of the event matches that probability. Formally:
P(outcome = correct | confidence = p) = p
For all values of p between 0 and 1.
In practice, this means:
- When the model says "60% probability," the event should occur ~60% of the time
- When the model says "80% probability," the event should occur ~80% of the time
- When the model says "30% probability," the event should occur ~30% of the time
This is visualized using a reliability diagram (calibration curve): you plot predicted probability on the x-axis against observed frequency on the y-axis. A perfectly calibrated model produces a diagonal line from (0,0) to (1,1).
Measuring Calibration
Brier Score
The most common metric for evaluating probabilistic predictions is the Brier score, introduced by Glenn Brier in 1950:
BS = (1/N) × Σ(pi − oi)²
Where pi is the predicted probability and oi is the actual outcome (1 or 0). Lower is better. Range: 0 (perfect) to 1 (worst).
The Brier score can be decomposed into three components:
| Component | What It Measures | Goal |
|---|---|---|
| Calibration (reliability) | How close predicted probabilities are to observed frequencies | Minimize |
| Resolution (sharpness) | How much predictions deviate from the base rate | Maximize |
| Uncertainty | Inherent unpredictability of the events (not controllable) | — |
A model can have good calibration but poor resolution (always predicting ~33% for each outcome in a 3-way market), or good resolution but poor calibration (making extreme predictions that don't match reality). The best models have both good calibration and good resolution.
Calibration Error
Expected Calibration Error (ECE) provides a more direct measure. It bins predictions by confidence level and computes the weighted average difference between predicted and observed frequencies:
ECE = Σ (nb/N) × |avg(pb) − avg(ob)|
Where b indexes bins, nb is the number of predictions in bin b, and avg(pb) and avg(ob) are the mean predicted and observed values in that bin.
Why Calibration Matters for Betting
This is where calibration becomes a financial issue. If a model says a team has a 60% chance of winning, and the bookmaker offers odds implying 55%, that looks like a value bet — a 5% edge. But what if the model is overconfident and the true probability is actually 53%? Now you're betting into negative expected value.
Model says: 60% → Implied odds: 1.67
Bookmaker offers: 1.82 (implied 55%) → Looks like +5% value
True probability: 53% → Actual edge: −2% (losing bet long-term)
An overconfident model systematically identifies "value" that doesn't exist. Over hundreds of bets, this destroys your bankroll. A well-calibrated model, even if slightly less accurate, gives you reliable probability estimates that you can actually use for decision-making.
This is why ExPrysm focuses on calibration as a primary metric. A model that says "65%" and means it is infinitely more useful than one that says "75%" but is only right 60% of the time.
How ExPrysm Calibrates Models
ExPrysm uses several approaches to ensure calibrated probability outputs:
CatBoost Native Probabilities
CatBoost, the gradient boosting framework used by ExPrysm, produces well-calibrated probabilities natively — better than most other tree-based models. This is because CatBoost uses ordered boosting and symmetric trees that reduce overfitting, which is a primary cause of miscalibration.
The match result model uses class_weights=[1.0, 1.3, 1.0] to slightly upweight draws during training. This addresses the known issue that draws are the hardest outcome to predict and are often underrepresented in model confidence.
Isotonic Regression
For post-hoc calibration, isotonic regression is a non-parametric method that learns a monotonic mapping from raw model scores to calibrated probabilities. It works by fitting a step function that minimizes the squared error between predicted and observed frequencies, subject to the constraint that the function is non-decreasing.
The advantage over parametric methods is that isotonic regression makes no assumptions about the shape of the calibration curve — it can correct any pattern of miscalibration.
Platt Scaling
Platt scaling fits a logistic regression on the model's raw outputs to produce calibrated probabilities. It's simpler than isotonic regression and works well when the miscalibration follows a sigmoid pattern. It's particularly useful for binary outcomes like BTTS or Over/Under markets.
Reading a Calibration Curve
A calibration curve (reliability diagram) is the most intuitive way to assess model quality. Here's how to read one:
| Pattern | Meaning | Implication |
|---|---|---|
| Points on diagonal | Perfect calibration | Predicted probabilities match reality |
| Points above diagonal | Underconfident | Model says 50% but events happen 60% — conservative |
| Points below diagonal | Overconfident | Model says 70% but events happen 55% — dangerous |
| S-shaped curve | Mixed | Underconfident at extremes, overconfident in middle (or vice versa) |
For betting purposes, overconfidence is the most dangerous pattern. An overconfident model makes you think you have an edge when you don't. Underconfidence is less harmful — you might miss some value bets, but you won't systematically lose money.
Bin 30-40%: Model predicted ~35%, actual outcome rate = 33% ✓
Bin 50-60%: Model predicted ~55%, actual outcome rate = 57% ✓
Bin 70-80%: Model predicted ~75%, actual outcome rate = 73% ✓
Each bin's observed frequency is within a few percentage points of the predicted average — that's a well-calibrated model.
ExPrysm's Calibration Results
ExPrysm publishes calibration curves for all major markets on the Performance page. These curves are generated from real prediction data across 7,800+ matches and are updated regularly.
Key points about ExPrysm's calibration:
- Publicly available: Unlike most prediction services, ExPrysm's calibration data is visible to all users. You can verify the model's reliability yourself.
- Market-level granularity: Separate calibration curves are provided for match result (1X2), BTTS, Over/Under, and other markets. Each market has different calibration characteristics.
- Continuous monitoring: Calibration is tracked over time to detect drift. If the model becomes miscalibrated due to changing football dynamics, it's caught early.
- No cherry-picking: All predictions are included in calibration analysis — not just the ones the model got right. This is critical for honest evaluation.
View ExPrysm's live calibration curves and Brier scores on the Performance page. All data is from real predictions, not backtests.
Conclusion
Accuracy is the metric everyone asks about. Calibration is the metric that actually matters. A well-calibrated model gives you probabilities you can trust and act on. An uncalibrated model — no matter how "accurate" — can lead you to systematically bad decisions.
ExPrysm prioritizes calibration through CatBoost's native probability estimation, post-hoc calibration techniques, and transparent public reporting of calibration curves. When the model says 65%, it means 65% — and that's the foundation everything else is built on.
Want to understand how confidence scores translate to betting decisions? Read our How to Choose Football Bets guide.