Premier League Betting Trial Shows AI Still Struggles Over Time

Premier League Betting Trial Shows AI Still Struggles Over Time
A new study from General Reasoning found that eight frontier AI models all lost money in a simulated 2023/24 Premier League betting season. The results point to a poor performance of existing models in a prolonged decision-making environment.

General Reasoning released KellyBench on April 9. It was framed as a long-horizon benchmark built around the 2023/24 English Premier League season. In that setup, AI agents were asked to grow a bankroll over time.

How the Test Was Run

Each model received historical match and player data, public bookmaker odds, and a sandbox to build and adjust its own betting approach as the season moved forward. Runs covered roughly 100 to 150 matchdays. The agents had to keep making decisions as new results came in.

The benchmark used a normalized starting bankroll of £100,000 per run. The models were tested across three seeds each. They had to place at least one bet per matchday, though they could still conserve capital by keeping stakes very small. The challenge was tough from the outset, as the paper notes that bookmaker odds included a margin of about 5.3%.

Losses Were Broad Across the Field

No model finished the benchmark with a positive average return. Claude Opus 4.6 posted the best result, but it still lost 11% on average across three runs. GPT-5.4 followed at minus 13.6%. Those two were the only models to avoid ruin across all three seeds. Yet neither managed to produce a net gain.

The rest of the field performed worse. Gemini 3.1 Pro, Gemini Flash 3.1 LP, GLM-5, Kimi K2.5, Grok 4.20, and Arcee Trinity failed to avoid ruin or forfeiture in at least one seed. Some runs ended in bankruptcy or were counted as forfeits after the agent failed to continue.

Only two of the 24 seeds finished with a profit, and both came from models that experienced ruin in other runs. This suggests weak consistency in how these models perform.

Why This Matters Beyond Sports Betting

According to the researchers, the problem lies in the gap between analysis and execution. KellyBench not only scored outcomes but also used a process-based metric called sophistication, built with input from quantitative betting experts and scored on a 44-point rubric. No model scored more than a third of the available points, and the paper says their strategies were unsophisticated relative to human baselines.

This study does not prove humans can easily beat Premier League markets, nor does it resolve all concerns surrounding autonomous AI. However, it demonstrates that outstanding performance on specialized tests does not mean that the system will behave stably in unpredictable and changing conditions.

Have you enjoyed the article?

Link Copied