OpenAI has partnered with crypto investment firm Paradigm to launch EVMbench, a new benchmark that tests how well AI agents can secure Ethereum‑style smart contracts. The system evaluates models on real vulnerabilities, aiming to turn AI into a serious tool for auditing DeFi code that now guards more than $100 billion in assets.
How EVMbench Tests AI on Real-World Smart Contract Bugs
EVMbench uses 120 high‑severity vulnerabilities collected from 40 professional audits and security reviews, including cases from Paradigm’s Tempo blockchain. Instead of toy puzzles, each task mirrors real audits from contests like Code4rena and internal reviews of production contracts.
The benchmark runs inside a sandboxed EVM environment, so agents interact with live bytecode without touching mainnet funds. It uses three modes: Detect, where AI audits code and recalls known bugs; Patch, where it proposes fixes without breaking logic; and Exploit, where it chains attacks to drain funds in a controlled local setup.
Based on preliminary findings, OpenAI’s GPT-5.3-Codex achieved a success rate of approximately 72% in exploit mode, compared to approximately 32% for a previous GPT-5.0 baseline. The inability of models to completely identify all problems and produce secure fixes, however, reveals a discrepancy between offensive power and defensive coverage.
Why OpenAI and Paradigm Built the Benchmark Now
According to OpenAI and Paradigm, they developed EVMbench in reaction to previous DeFi attacks such as the Moonwell and CrossCurve incidents, which collectively contributed to hack losses exceeding 86 million dollars in January 2026. The stakes for improved defensive tools are raised by research from Anthropic and others that indicates AI can lower the expense and effort of attack planning.
By standardizing how teams evaluate AI agents on smart contract security, EVMbench gives developers and auditors a common yardstick instead of scattered private tests. OpenAI has also committed significant API credits to support defensive uses of its models, especially for open‑source projects and critical infrastructure.
For DeFi teams, EVMbench offers a way to check whether AI assistants can actually find and fix the same bugs human auditors care about. Projects can run models through the benchmark before trusting them with production audits or integrating them into continuous monitoring pipelines.
READ MORE: Coinbase Adds XRP, DOGE, ADA, LTC to US Crypto Loan Service