Particle.news

Download on the App Store

OpenAI Unveils GDPval Benchmark Showing GPT-5 and Claude Near Expert Quality

The evaluation measures 1,320 realistic occupation tasks across top U.S. industries to gauge economic value in real work.

Overview

  • OpenAI’s GDPval tests models on 1,320 tasks spanning 44 occupations drawn from the nine industries that contribute most to U.S. GDP.
  • Blind reviews by experienced professionals found GPT-5-high ranked better than or on par with experts 40.6% of the time, while Claude Opus 4.1 reached about 49%.
  • OpenAI reports Claude excelled at aesthetics and formatting, whereas GPT-5 showed strengths in accuracy and domain-specific knowledge.
  • The study compared multiple frontier models, including GPT-4o, o4-mini, o3, GPT-5, Claude Opus 4.1, Gemini 2.5 Pro, and Grok 4.
  • OpenAI claims models can complete tasks roughly 100x faster and cheaper in inference-only terms and it cautions that GDPval covers one-off, file-based tasks, with an experimental autograder released and plans to expand to interactive, context-rich work.