Overview
- GDPval evaluates 220 realistic tasks drawn with industry professionals and uses blind reviews by expert graders to compare AI outputs with human examples.
- Anthropic’s Claude Opus 4.1 posted the highest win or tie rate at 47.6%, GPT-5-high scored 38.8%, and GPT-4o trailed at 12.4% in head-to-head grading against human deliverables.
- Performance varied by role, with strong results on software development and clerical tasks and weak results for industrial engineers, pharmacists, video editors, and audio/video technicians.
- Task-level highlights included Claude’s 81% win or tie rate for counter and rental clerks and 76% for shipping and inventory clerks, versus 17% for industrial engineers and film/video editors and 2% for audio/video technicians.
- OpenAI says models can complete these tasks roughly 100 times faster and cheaper than experts, while cautioning that the benchmark is small, focused on computer-based work, and relies on background materials prepared by humans; outside coverage urges careful interpretation due to hallucinations and mixed real-world ROI.