Particle News: OpenAI Urges Overhaul of AI Benchmarks to Curb Hallucinations by Rewarding Uncertainty

Overview

In a new paper and blog post published on September 5, OpenAI argues that evaluation incentives—not just training limits—drive confidently false answers in large language models.
Proposed reforms would penalize confident errors and give partial credit for appropriate uncertainty, with a call to update widely used scoreboards that currently reward guessing.
The research includes examples of a widely used chatbot inventing multiple dissertation titles and birthdays for author Adam Tauman Kalai to illustrate the problem.
On the SimpleQA test, the older o4-mini showed higher nominal accuracy yet a 75% error rate, while GPT-5-thinking-mini recorded 22% accuracy with far fewer confident mistakes by abstaining more often.
Nature’s review of new evaluations reports GPT-5 hallucinated 0.8% of claims with web browsing enabled (1.4% offline) versus o3 at 5.1% (7.9% offline), and it produced incorrect citations 39% of the time offline—about half GPT-4o’s rate—though OpenAI notes hallucinations persist and some questions are inherently unanswerable.