Particle News: OpenAI Releases GDPval Work Benchmark as Claude Leads Early Results

Overview

GDPval evaluates 220 realistic tasks drawn with industry professionals and uses blind reviews by expert graders to compare AI outputs with human examples.
Anthropic’s Claude Opus 4.1 posted the highest win or tie rate at 47.6%, GPT-5-high scored 38.8%, and GPT-4o trailed at 12.4% in head-to-head grading against human deliverables.
Performance varied by role, with strong results on software development and clerical tasks and weak results for industrial engineers, pharmacists, video editors, and audio/video technicians.
Task-level highlights included Claude’s 81% win or tie rate for counter and rental clerks and 76% for shipping and inventory clerks, versus 17% for industrial engineers and film/video editors and 2% for audio/video technicians.
OpenAI says models can complete these tasks roughly 100 times faster and cheaper than experts, while cautioning that the benchmark is small, focused on computer-based work, and relies on background materials prepared by humans; outside coverage urges careful interpretation due to hallucinations and mixed real-world ROI.