Particle.news

Download on the App Store

OpenAI Urges Overhaul of AI Benchmarks to Curb Hallucinations by Rewarding Uncertainty

OpenAI says accuracy-only leaderboards push models to guess rather than admit they don’t know.

Overview

  • In a new paper and blog post published on September 5, OpenAI argues that evaluation incentives—not just training limits—drive confidently false answers in large language models.
  • Proposed reforms would penalize confident errors and give partial credit for appropriate uncertainty, with a call to update widely used scoreboards that currently reward guessing.
  • The research includes examples of a widely used chatbot inventing multiple dissertation titles and birthdays for author Adam Tauman Kalai to illustrate the problem.
  • On the SimpleQA test, the older o4-mini showed higher nominal accuracy yet a 75% error rate, while GPT-5-thinking-mini recorded 22% accuracy with far fewer confident mistakes by abstaining more often.
  • Nature’s review of new evaluations reports GPT-5 hallucinated 0.8% of claims with web browsing enabled (1.4% offline) versus o3 at 5.1% (7.9% offline), and it produced incorrect citations 39% of the time offline—about half GPT-4o’s rate—though OpenAI notes hallucinations persist and some questions are inherently unanswerable.