Onchain Execution Benchmark V0.1

Introducing the Onchain Execution Benchmark (OCE)

Over the past year, several crypto agents have emerged with basic functionality, promising to translate natural language commands into onchain transactions. Yet until now, the community has lacked a standard way to evaluate whether these agents actually deliver on that promise.

The Onchain Execution Benchmark (OCE) fills that gap. Instead of evaluating outputs as text-based responses, OCE measures something more concrete: do the transactions initiated by the agent produce the exact onchain state change the user requested when replayed in a forked mainnet environment?

Since most LLMs are not built for execution, each model is wired to a forked Ethereum node through a JSON schema tool interface (e.g., send_transaction, call_contract). When the model emits one of these tool calls, the harness signs and broadcasts a real transaction on the fork, then immediately inspects the resulting state. This allows a chatbot-style LLM to execute and validate live transactions inside a sandboxed testnet.

Key Results & Takeaways

Accuracy ceiling is still low – The best agent passes only ~25 % of tasks; full error free onchain autonomy still has a long way to go.
Close top tier – o4mini and o3 sit within one percentage point — prompt improvements matter as much as model size.
Middle of the pack – Claude clears ~17 % but suffers on multistep DeFi flows and ENS commit/register steps.
Reliability gap – DeepSeek handles basic wraps and transfers yet fails most swap scenarios, often returning empty TX bundles.

Overall summary: DeFi multistep interactions remain the dominant failure point across all open agents, leaving plenty of headroom for research on planning, action control, and gas aware execution.

What does the dataset look like?

The initial release (v0.1) ships a compact but carefully chosen suite of 70 real world tasks drawn from the contracts that dominate Ethereum gas charts, and here are some tasks example:

Theme	Examples taken straight from the dataset	Why it matters
Wrapping & unwrapping	“Wrap 1 ETH to WETH”, “Unwrap 1 WETH to ETH”	First step for almost every DeFi action
DEX swaps	“Swap 1 WETH → USDC on Uniswap V3 with ≤5 % slippage”, “Swap 1 ETH on Uniswap V2”	Tests routing, allowance handling, slippage protection
Staking	“Stake 0.5 ETH to Lido”, “Redeem 0.5 wstETH back to ETH”	Requires contract approvals and receipt tokens
Simple transfers	“Send 1 ETH to 0xAd4C…216”	Baseline sanity check
Composite flows	Multi-step recipes such as “Stake, supply to Aave, borrow stablecoins, swap on Uniswap”	Mirrors users’ actual daily actions
ENS registrations	“Register caiba-ai.eth for one year”	Touches oracles, time based checks, and NFT issuance

Every task comes with:

A natural language goal (“tasks”) that the agent reads and executes.
A strict acceptance test describing token balances, contract calls, tool uses or NFT ownership that must hold after execution.
A stable mainnet snapshot (block 22,636,495) so that prices, liquidity, and contract states are frozen in time.

How the evaluation works — and how you can replicate it yourself

OCE is designed to make replicability the priority goal. Here’s the flow:

Spin up a deterministic sandbox

The benchmark script launches Foundry’s Anvil in fork-mode, pinning it to the exact block height specified in each task. Gas prices, pool reserves, even pending ENS auctions are identical for every participant and as similar as possible to actual onchain transactions.
Complete process visualization

Your agent reads the task and returns a JSON list of unsigned transactions. It can include approvals, multicalls, anything a normal wallet could broadcast.
Automatic signing & replay

OCE signs the bundle with a funded test key, replays it against the fork, and records gas usage plus success or revert traces.
Checkpoint against ground truth

Each task shipped will be checked by an agent on post-transaction balances, debt positions, liquidity tokens, time-to-expiry, and so on. The oracle awards 10 points if everything is correct and does not give points if there’s one mistake in the entire process.
Reset and repeat

Before the next task begins, the fork is reset. That isolation eliminates cross-test contamination and makes runs fully reproducible.

Here is an example to demonstrate the entire flow:

Because the fork URL, block number, task JSON, and checker scripts are all version-controlled, anyone can clone the repository and run python run_evaluation.py to verify your claims.

Where the benchmark still falls short

EVM environment only. Solana and other non-EVM chains have not yet been incorporated.
DeFi centric. NFTs, DAO votes, and cross chain bridges are lightly represented or absent.
Gas is noted, not judged. Today the score cares about correctness, not efficiency, so less optimized solutions aren’t penalized.
Single account assumption. Multisigs, MPC wallets, and social recovery flows aren’t tested.
Static snapshot. Liquidity and oracle prices are frozen; live-routing logic may behave differently on today’s mainnet.
Model coverage is narrow. So far the benchmark has been run only against a handful of SOTA agents. That sampling does not capture the full diversity of approaches and tooling across the industry.

OCE is not perfect yet, but it’s a start. Together with our partners in CAIBA, we are actively tackling the shortcomings above and will be rolling out improved versions of the benchmark in the months ahead. Stay tuned!

Check our public Github: https://github.com/cyberconnecthq/oce_benchmark