Overview
- EVMbench evaluates AI agents across three distinct modes that measure detection of bugs, controlled exploitation, and safe patching without breaking contract functionality.
- The benchmark draws on 120 disclosed vulnerabilities collected from roughly 40 audits and public competitions such as Code4rena, with additional cases from Stripe’s Tempo reviews.
- In initial exploit-mode tests, GPT-5.3-Codex scored 72.2% compared with 31.9% for GPT-5, while performance on detection and patching remained notably weaker.
- Exploitation attempts run in a sandboxed environment with deterministic transaction replay, and only previously disclosed issues are included to avoid live-network risk.
- OpenAI underscores that the test cannot fully capture real-world complexity and frames the release as support for defensive auditing as smart contracts secure over $100 billion in crypto assets.