Particle.news

Download on the App Store

OpenAI’s GDPval Benchmark Finds Claude and GPT-5 Near Expert Quality on Real-World Tasks

OpenAI frames GDPval as an early step toward broader, context-aware evaluations.

Overview

  • GDPval tests 1,320 realistic, file-based assignments across 44 occupations in nine top GDP industries using tasks authored by professionals averaging 14 years of experience.
  • Blind human graders evaluated outputs from multiple models and human experts, with an AI autograder released as an experimental research tool to supplement—not replace—human scoring.
  • TechCrunch reports GPT-5-high matched or beat experts 40.6% of the time and Claude Opus 4.1 49%, with OpenAI noting Claude’s edge in aesthetics and GPT-5’s strength in accuracy.
  • OpenAI says models completed tasks roughly 100× faster and 100× cheaper than experts, a figure limited to inference time and billing that excludes oversight, iteration, and integration costs.
  • Performance has advanced sharply since GPT-4o, and OpenAI plans to release subsets of GDPval tasks and expand the benchmark to interactive and context-rich workflows while cautioning against reading the results as evidence of imminent broad job replacement.