Overview
- New research from OpenAI argues that hallucinations stem from next‑token pretraining and accuracy‑only evaluations that encourage guessing over expressing uncertainty.
- OpenAI proposes updating widely used leaderboards to penalize incorrect confident answers and to award neutral credit for appropriate “I don’t know” responses.
- Company data indicate GPT-5 abstains more and makes fewer confident errors than some prior models, with one test showing 52% abstention and 26% error versus o4-mini’s 1% abstention and 75% error.
- Performance varies by setting: browsing access reduced long‑form hallucinations and cut fake citations, yet an offline test still found roughly 39% wrong citations in one evaluation.
- Independent assessments are mixed and experts caution that complete elimination is unlikely, with some benchmarks slightly favoring Google’s Gemini 2.0 and users continuing to report factual mistakes.