Tennis Model v3: Fixing a Data Leak and Refreshing with 2025–2026 Data
Good modelling is as much about what you remove as what you add. This update to our tennis model is a case in point: we found a subtle but meaningful data leak, fixed it, retrained on two fresh seasons, and pruned features that were adding noise rather than signal. The result is tennis v3 — not a dramatic accuracy leap, but a more honest model that we trust more at the edges. It is live now for both ATP and WTA markets.
The leak: full-season stats used mid-season
Our model uses per-player serve and return statistics — first-serve win percentage, return win percentage — as features. These are genuine skill signals: a player's ability to hold serve or break their opponent's serve is stable and predictive. The problem was in how we stored and accessed those numbers.
We were storing serve stats as full, final-season aggregates. When the model trained on, say, a match played in June, it read the player's serve percentage for the entire season — including matches played in July, August, and beyond. The model was quietly peeking at the future. In the literature this is called a look-ahead leak, and it is one of the more insidious bugs in sports modelling because it does not break anything visibly — it just makes the model look slightly better than it really is on historical data, while being wrong in a way that can only be detected carefully.
The fix: prior-season stats only
The fix is straightforward in principle: replace in-season aggregates with the prior completed season's serve stats. Before a match in June 2025, the model now sees only what the player achieved across the full 2024 season — data that was firmly settled before the 2025 season even started. Strictly point-in-time, no future information.
This is also a sensible prior. Serve quality is one of the most stable year-over-year skills in tennis: a player who won 68% of first-serve points in 2024 is very likely to be close to that number in 2025. Using the prior season is both honest and accurate enough to preserve the signal. If anything, in-season aggregates early in a season are noisier (small samples) than a full prior-season figure, so the fix is not even a meaningful trade-off — it is strictly better data practice.
Honest result: integrity first, small accuracy gain
Because Elo — the overall strength rating — is the dominant signal in tennis prediction, and serve stats are a supplementary, Elo-correlated feature, closing the leak does not produce a dramatic accuracy jump. What it does produce is accuracy that we can actually trust. The head-to-head Brier score improved from 0.1746 → 0.1726 (ATP) and 0.1523 → 0.1508 (WTA). Lower is better — both are meaningful improvements, but modest ones. We are not going to dress a data integrity fix as a step-change in performance.
We also tested a more ambitious idea: using serve stats to adjust our totals (games) predictions, not just head-to-head match winner. A player's serve dominance should in theory affect total games played. On the leaky historical data the adjustment looked promising. When we re-ran the test on the corrected, point-in-time data, the apparent benefit vanished — the signal had been an artefact of the leak. We cut the idea rather than ship something that only looked good on compromised data. That is the right call, even if it means one fewer feature.
Fresh data and per-tour feature pruning
Alongside the leak fix, v3 also incorporates two full seasons that were missing from the previous training set: the complete 2025 season and the year-to-date 2026 data. Both ATP and WTA Elo ratings and service stats have been recalculated from scratch on the updated dataset, so every historical rating now reflects the correct ranking of evidence.
We also ran a systematic feature pruning exercise, separately for ATP and WTA. Not every feature that helps one tour helps the other — tennis is a meaningfully different sport for men and women in terms of surface effects, serve dominance, and the weight of different match contexts. Features that were statistically noise on one tour were dropped for that tour only:
- ATP: pruned down from 30 to 18 features — twelve feature groups removed, saving the model from fitting noise that was specific to the ATP training set.
- WTA: pruned from 30 to 28 — a lighter touch, removing only a literal duplicate feature. Broader pruning regressed WTA accuracy, so we kept the full set.
- Both tours: features validated as backbone signals include days-since-last-match, recent match load, return win rate, first-serve win rate, and surface-specific form.
This kind of per-tour discipline matters: a feature that adds noise in ATP training is not just harmless — it actively degrades the model's ability to find the real signals. Fewer, cleaner features beat more noisy ones at the same sample size.
What this means for value bets
The goal of any model update is not to impress on a benchmark — it is to make our probability estimates more trustworthy so that the closing line value we capture on live bets reflects real edge rather than artefacts. A model with a subtle leak may find apparent value that a corrected model does not. We would rather lose a handful of marginal bets than systematically bet on a mirage.
In practice, the market impact is small — v3 probabilities are close to v2 on most matches. But at the edges, particularly for matches involving players whose serve quality diverged sharply between seasons, the corrections add up. You can track live model performance on the model page.
Tennis v3 is live for both ATP and WTA head-to-head markets. WTA totals remain available with the shrink-to-Pinnacle adjustment we shipped earlier this year, which addresses a separate calibration issue in the game-totals model. Past performance is not a guarantee of future results; bet responsibly, 18+ only.