Premier League Betting Trial Shows AI’s Long-Game Limits

General Reasoning released KellyBench on April 9. It was framed as a long-horizon benchmark built around the 2023/24 English Premier League season. In that setup, AI agents were asked to grow a bankroll over time.

Curaçao Raises the Compliance Bar for Licensed Gaming Operators

Brussels Budget Fight Opens the Door to Tech and Gambling Levies

Google Turns UAE Gambling Ads Into a Licensed-Only Channel

How the Test Was Run

Each model received historical match and player data, public bookmaker odds, and a sandbox to build and adjust its own betting approach as the season moved forward. Runs covered roughly 100 to 150 matchdays. The agents had to keep making decisions as new results came in.

The benchmark used a normalized starting bankroll of £100,000 per run. The models were tested across three seeds each. They had to place at least one bet per matchday, though they could still conserve capital by keeping stakes very small. The challenge was tough from the outset, as the paper notes that bookmaker odds included a margin of about 5.3%.

Losses Were Broad Across the Field

No model finished the benchmark with a positive average return. Claude Opus 4.6 posted the best result, but it still lost 11% on average across three runs. GPT-5.4 followed at minus 13.6%. Those two were the only models to avoid ruin across all three seeds. Yet neither managed to produce a net gain.

The rest of the field performed worse. Gemini 3.1 Pro, Gemini Flash 3.1 LP, GLM-5, Kimi K2.5, Grok 4.20, and Arcee Trinity failed to avoid ruin or forfeiture in at least one seed. Some runs ended in bankruptcy or were counted as forfeits after the agent failed to continue.

Only two of the 24 seeds finished with a profit, and both came from models that experienced ruin in other runs. This suggests weak consistency in how these models perform.

Why This Matters Beyond Sports Betting

According to the researchers, the problem lies in the gap between analysis and execution. KellyBench not only scored outcomes but also used a process-based metric called sophistication, built with input from quantitative betting experts and scored on a 44-point rubric. No model scored more than a third of the available points, and the paper says their strategies were unsophisticated relative to human baselines.

This study does not prove humans can easily beat Premier League markets, nor does it resolve all concerns surrounding autonomous AI. However, it demonstrates that outstanding performance on specialized tests does not mean that the system will behave stably in unpredictable and changing conditions.

Premier League Betting Trial Shows AI Still Struggles Over Time

How the Test Was Run

Losses Were Broad Across the Field

Why This Matters Beyond Sports Betting

Have you enjoyed the article?

More for you