Following the release of CAIA v0.1, our first benchmark purpose-built for evaluating crypto-native AI agents, we are releasing CAIA v0.2, a major update that increases both the breadth and difficulty of the benchmark.
CAIA (a benchmark for Crypto AI Agents) evaluates AI agents on tasks that reflect the daily responsibilities of an entry-level crypto analyst. These tasks require reasoning, contextual understanding, and domain-specific knowledge across three core areas: onchain analysis, tokenomics diagnostics, and project research.
Version 0.2 increases task coverage by 50 percent, growing from 40 to 60 tasks. It also adds a new focus area: trend analysis. This reflects how real users track narratives and identify early signals in a noisy market.
With expanded datasets and new modeling tests, CAIA v0.2 provides a more complete view into how well AI agents perform in crypto's fast-moving and high-context environment.
Model | Overall % | Answer % | Reasoning % | Tool Use % | Summary |
---|---|---|---|---|---|
OpenAI o3 | 45.0 | 32.5 | 47.7 | 50.8 | Most complete agent — excels across categories, dominates onchain operations |
OpenAI o4 mini | 39.2 | 32.3 | 40.9 | 39.8 | Strong all-rounder — leads in tokenomics analysis, consistent performance |
claude-4-sonnet-no-thinking | 36.1 | 31.7 | 39.4 | 33.5 | Discovery specialist — best at project research, but tool usage remains unreliable |
OpenAI 4.1 | 27.2 | 24.9 | 28.7 | 26.3 | Solid reasoning — handles context well, but struggles with numeric answers |
DeepSeek r1 | 26.4 | 17.6 | 29.9 | 27.5 | Decent planner — reasonable task breakdowns, weak on execution |
DeepSeek v3 | 17.5 | 9.9 | 21.9 | 18.6 | Lags behind — underperforms overall with some spikes in research |
Room for improvement across the board: The best run tops out at 45 %, leaving a ~40-point gap to human junior-analyst level. Most of that gap is mundane — unit slips, execution misses, schema errors — not missing domain knowledge.
Suite champions are clear:
• o3 performs best across Onchain Analysis, Trend Analysis, and Overlapping tasks.
• o4 mini is the strongest at interpreting Tokenomics.
• Claude-4 dominates Project Discovery.
Tools dictate the ceiling: There's a strong correlation (r ≈ 0.8) between tool use and final answer quality. Simple improvements, e.g., rate-limit retries, unit sanity checks, and block-range helpers, would lift all models more than prompt tuning.
Reasoning ≠ delivery: DeepSeek models reason nearly as well as the GPT-4 series but fail on output. They often emit hex, raw ABIs, or incomplete fields. A lightweight output linter could recover double-digit performance gains.
Category dynamics:
• Tokenomics has the widest Tool–Answer gap: Supply math and vesting cliffs often break without purpose-built calculators.
• Overlap tasks reveal schema brittleness: A single extra repo link or address variant can collapse an otherwise correct answer.
• Trend analysis tasks mostly fail due to rate limits: Adding exponential back-off would boost performance with minimal engineering lift.
Model | Best Performance | Areas for Improvement | Representative Task |
---|---|---|---|
o3 | Highest tool score (50.8 %) and excels at precise onchain look-ups. | Token-econ prose is thinner (9-pt gap to o4 mini). | “What’s the Uniswap V3 router contract address?” (task-id r4567890-…) — o3 scored 98 % with ENS-based RPC lookup; median of peers ≤ 65 %. |
o4 mini | Most balanced; wins Tokenomics suite (50 %). | Suffers on rate-limited DEX sweeps. | “What’s Pendle’s router contract address?” (p2345678-…) — only model to hit 90 %+. |
Claude-4 | #1 in Project Discovery (58 %); concise rhetoric. | Omits startBlock in many calls → partial data. | “List three interoperability competitors to Hyperlane …” (d0227345-…) — clocks 98 %; GPT family 70–75 %. |
OpenAI 4.1 | Reasoning still solid (28.7 %). | Arithmetic slips on LP/TVL math. | Same Hyperlane task — lands 97 % overall but drops 12 % on liquidity-pool TVL query. |
DeepSeek r1 | Reasoning chain ≥ 30 %. | Formats numbers in hex, drops decimals. | Timestamp lookup (0f1f8a27-…) — logic perfect; answer off by 1 sec → 0 % score. |
DeepSeek v3 | Research trivia occasionally shines. | Lowest Answer accuracy (9.9 %); tool time-outs. | “Donut seed raise & investors?” (m9012345-…) — DS v3 scored 83.6 %; others < 65 %. |
suite | winner | Δ to runner-up |
---|---|---|
Onchain Analysis | o3 (41.5 %) | +7.8 pt over o4 mini |
Tokenomics Deep-Dive | o4 mini (50.0 %) | +8.1 pt over Claude-4 |
Project Discovery | Claude-4 (57.9 %) | +0.8 pt over o3 |
Trend Analysis | o3 (46.1 %) | +13.5 pt over o4 mini |
Overlap (mixed) | o3 (44.0 %) | +5.3 pt over Claude-4 |

Std-dev across models: total ± 10 %, answer ± 9 %, reasoning ± 9.5 %, tool_use ± 11 %.
This shows that tool competence is the largest differentiator.
Correlation: reasoning ↔ tool = 0.78, showing the strongest link among the 3 evaluation dimensions, suggesting that tasks where agents reason well tend to have accurate, well parameterized tool calls. Answer also has positive correlation with both Tool use and reasoning, suggesting that improving the reasoning and tool use has an overall positive effect on improving final output.
Archetype | Common Failure | Spreadsheet-backed exemplar (task-id) |
---|---|---|
Long-range time-series | Fetches daily data but forgets monthly aggregation. | Daily swap volume ETH/USDC (89bebeca-…) — best Answer 39 %, median 5 %. |
Unique-address counting | No tool call; Etherscan page-scrape times out. | “Count unique wallets interacting with Ethereum Cyber Bridge on 2025-04-01” (0059aec1-…) — every model Answer ≤ 4 %, Tool = 0 %. |
Proxy-contract metadata | Retrieves correct onchain data but mis-formats answer. | AVSDirectory proxy + GitHub repo (f653d8cb-…) — Tool avg 80 %, Answer avg 34 %. |
Count unique wallet addresses that interacted (including internal transactions) with the Ethereum Cyber Bridge on UTC 2025-04-01.
Tool_use — every model logged 0 %: none issued an eth_getLogs or Dune query; they resorted to prose guessing.
Answer — Claude-4 scraped a random blog and still earned 4 %. The rest scored 0 %.
Root cause: The bridge contract emits internal calls; naïve Etherscan scraping doesn’t expose them. Agents need a block-range trace_filter or an archived-node internal-txn query, which none attempted.
Category | Avg Answer | Avg Reasoning | Avg Tool | Tool – Answer |
---|---|---|---|---|
Onchain Analysis | 31 | 34 | 40 | +9 |
Tokenomics | 34 | 41 | 37 | +3 |
Project Discovery | 44 | 51 | 46 | +2 |
Overlap (mixed) | 27 | 35 | 38 | +11 |
Trend Analysis | 30 | 36 | 44 | +14 |
Tool–Answer gaps are largest in Trend and Overlap. Agents know what the question is asking and what are the tools to use but fumble last-mile aggregation and execution.
For example, on the task “For the top 5 Pump.fun meme tokens by 7-day trading volume from 2025-05-05 to 2025-05-12, provide their contract address"
O3 “partially sourced contract addresses by searching Solscan pages but did not attempt to fetch or convert onchain creation timestamps as required.”
Project Discovery is the easiest suite (mid-40s average) because it relies on open-web search plus summarization—minimal onchain calls needed.
Onchain Analysis underperforms mostly on stable-swap arithmetic; reasoning is underperforming (34 %), and calculations are even worse (31 %).
Task Expansion
CAIA v0.3 will grow the benchmark from 60 to 80 tasks. Expect new datasets proposed by Alliance members, with added coverage in tokenomics, trend analysis, and project discovery to ensure a more balanced distribution.
Scope Refinement
We will introduce more question types to better reflect real-world analyst workflows. The goal is to increase CAIA’s inclusiveness and ensure it captures the full range of tasks crypto analysts actually face.
Agent Expansion
More crypto-native agents will be evaluated alongside general-purpose models. This will offer a clearer picture of where specialized stacks outperform.
Community Contribution
Alliance participation is growing. Expect more members to contribute tasks, propose improvements, and help developing CAIA.