Particle.news

Download on the App Store

OpenAI Releases GDPval Work Benchmark as Claude Leads Early Results

OpenAI frames the 44-occupation test as evidence to guide workplace use, noting narrow scope plus human-provided materials.

Overview

  • GDPval evaluates 220 realistic tasks drawn with industry professionals and uses blind reviews by expert graders to compare AI outputs with human examples.
  • Anthropic’s Claude Opus 4.1 posted the highest win or tie rate at 47.6%, GPT-5-high scored 38.8%, and GPT-4o trailed at 12.4% in head-to-head grading against human deliverables.
  • Performance varied by role, with strong results on software development and clerical tasks and weak results for industrial engineers, pharmacists, video editors, and audio/video technicians.
  • Task-level highlights included Claude’s 81% win or tie rate for counter and rental clerks and 76% for shipping and inventory clerks, versus 17% for industrial engineers and film/video editors and 2% for audio/video technicians.
  • OpenAI says models can complete these tasks roughly 100 times faster and cheaper than experts, while cautioning that the benchmark is small, focused on computer-based work, and relies on background materials prepared by humans; outside coverage urges careful interpretation due to hallucinations and mixed real-world ROI.