Over the past year, several crypto agents have emerged with basic functionality, promising to translate natural language commands into onchain transactions. Yet until now, the community has lacked a standard way to evaluate whether these agents actually deliver on that promise.
The Onchain Execution Benchmark (OCE) fills that gap. Instead of evaluating outputs as text-based responses, OCE measures something more concrete: do the transactions initiated by the agent produce the exact onchain state change the user requested when replayed in a forked mainnet environment?
Since most LLMs are not built for execution, each model is wired to a forked Ethereum node through a JSON schema tool interface (e.g., send_transaction, call_contract). When the model emits one of these tool calls, the harness signs and broadcasts a real transaction on the fork, then immediately inspects the resulting state. This allows a chatbot-style LLM to execute and validate live transactions inside a sandboxed testnet.
Overall summary: DeFi multistep interactions remain the dominant failure point across all open agents, leaving plenty of headroom for research on planning, action control, and gas aware execution.
The initial release (v0.1) ships a compact but carefully chosen suite of 70 real world tasks drawn from the contracts that dominate Ethereum gas charts, and here are some tasks example:
Theme | Examples taken straight from the dataset | Why it matters |
---|---|---|
Wrapping & unwrapping | “Wrap 1 ETH to WETH”, “Unwrap 1 WETH to ETH” | First step for almost every DeFi action |
DEX swaps | “Swap 1 WETH → USDC on Uniswap V3 with ≤5 % slippage”, “Swap 1 ETH on Uniswap V2” | Tests routing, allowance handling, slippage protection |
Staking | “Stake 0.5 ETH to Lido”, “Redeem 0.5 wstETH back to ETH” | Requires contract approvals and receipt tokens |
Simple transfers | “Send 1 ETH to 0xAd4C…216” | Baseline sanity check |
Composite flows | Multi-step recipes such as “Stake, supply to Aave, borrow stablecoins, swap on Uniswap” | Mirrors users’ actual daily actions |
ENS registrations | “Register caiba-ai.eth for one year” | Touches oracles, time based checks, and NFT issuance |
Every task comes with:
OCE is designed to make replicability the priority goal. Here’s the flow:
Spin up a deterministic sandbox
The benchmark script launches Foundry’s Anvil in fork-mode, pinning it to the exact block height specified in each task. Gas prices, pool reserves, even pending ENS auctions are identical for every participant and as similar as possible to actual onchain transactions.
Complete process visualization
Your agent reads the task and returns a JSON list of unsigned transactions. It can include approvals, multicalls, anything a normal wallet could broadcast.
Automatic signing & replay
OCE signs the bundle with a funded test key, replays it against the fork, and records gas usage plus success or revert traces.
Checkpoint against ground truth
Each task shipped will be checked by an agent on post-transaction balances, debt positions, liquidity tokens, time-to-expiry, and so on. The oracle awards 10 points if everything is correct and does not give points if there’s one mistake in the entire process.
Reset and repeat
Before the next task begins, the fork is reset. That isolation eliminates cross-test contamination and makes runs fully reproducible.
Here is an example to demonstrate the entire flow:
Because the fork URL, block number, task JSON, and checker scripts are all version-controlled, anyone can clone the repository and run python run_evaluation.py to verify your claims.
OCE is not perfect yet, but it’s a start. Together with our partners in CAIBA, we are actively tackling the shortcomings above and will be rolling out improved versions of the benchmark in the months ahead. Stay tuned!
Check our public Github: https://github.com/cyberconnecthq/oce_benchmark