I would not declare a winner until at least 30-50 settled pre-match predictions per model.
For now:
- Track every match.
- Exclude post-match reviews from accuracy.
- Compare cheap vs flagship models by cost per correct winner.
- Watch draw prediction rate.
- Add a baseline from betting markets or Elo.
- Update after each matchday.
If you want the full data-cited writeup and live links, I wrote the original breakdown here: AI World Cup Predictions 2026: 12 Models, Early Leaderboard.
Disclosure: I work on the research side at TokenMix, which is why I can wire this kind of multi-model scoreboard quickly.
Bottom line
The early World Cup AI leaderboard does not tell us which model is best yet.
It does tell us something useful: cheap models can match flagship consensus on obvious favorites, and all models can share the same bad prior on a draw.
That is a model-evaluation lesson, not betting advice.
If you were scoring this, would you reward exact score heavily, or focus on calibrated probabilities instead?