Overview
- In a new paper and blog post published on September 5, OpenAI argues that evaluation incentives—not just training limits—drive confidently false answers in large language models.
- Proposed reforms would penalize confident errors and give partial credit for appropriate uncertainty, with a call to update widely used scoreboards that currently reward guessing.
- The research includes examples of a widely used chatbot inventing multiple dissertation titles and birthdays for author Adam Tauman Kalai to illustrate the problem.
- On the SimpleQA test, the older o4-mini showed higher nominal accuracy yet a 75% error rate, while GPT-5-thinking-mini recorded 22% accuracy with far fewer confident mistakes by abstaining more often.
- Nature’s review of new evaluations reports GPT-5 hallucinated 0.8% of claims with web browsing enabled (1.4% offline) versus o3 at 5.1% (7.9% offline), and it produced incorrect citations 39% of the time offline—about half GPT-4o’s rate—though OpenAI notes hallucinations persist and some questions are inherently unanswerable.