Particle.news

OpenAI and Paradigm Launch EVMbench to Test AI on Real-World Smart-Contract Flaws

OpenAI positions the test as a step toward defensive standards for high‑stakes smart‑contract security.

Overview

  • EVMbench evaluates AI agents across three distinct modes that measure detection of bugs, controlled exploitation, and safe patching without breaking contract functionality.
  • The benchmark draws on 120 disclosed vulnerabilities collected from roughly 40 audits and public competitions such as Code4rena, with additional cases from Stripe’s Tempo reviews.
  • In initial exploit-mode tests, GPT-5.3-Codex scored 72.2% compared with 31.9% for GPT-5, while performance on detection and patching remained notably weaker.
  • Exploitation attempts run in a sandboxed environment with deterministic transaction replay, and only previously disclosed issues are included to avoid live-network risk.
  • OpenAI underscores that the test cannot fully capture real-world complexity and frames the release as support for defensive auditing as smart contracts secure over $100 billion in crypto assets.