Overview
- OpenAI’s GDPval tests models on 1,320 tasks spanning 44 occupations drawn from the nine industries that contribute most to U.S. GDP.
- Blind reviews by experienced professionals found GPT-5-high ranked better than or on par with experts 40.6% of the time, while Claude Opus 4.1 reached about 49%.
- OpenAI reports Claude excelled at aesthetics and formatting, whereas GPT-5 showed strengths in accuracy and domain-specific knowledge.
- The study compared multiple frontier models, including GPT-4o, o4-mini, o3, GPT-5, Claude Opus 4.1, Gemini 2.5 Pro, and Grok 4.
- OpenAI claims models can complete tasks roughly 100x faster and cheaper in inference-only terms and it cautions that GDPval covers one-off, file-based tasks, with an experimental autograder released and plans to expand to interactive, context-rich work.