CAIA v0.2: Closing Gaps, Raising Floors

Following the release of CAIA v0.1, our first benchmark purpose-built for evaluating crypto-native AI agents, we are releasing CAIA v0.2, a major update that increases both the breadth and difficulty of the benchmark.

CAIA (a benchmark for Crypto AI Agents) evaluates AI agents on tasks that reflect the daily responsibilities of an entry-level crypto analyst. These tasks require reasoning, contextual understanding, and domain-specific knowledge across three core areas: onchain analysis, tokenomics diagnostics, and project research.

Version 0.2 increases task coverage by 50 percent, growing from 40 to 60 tasks. It also adds a new focus area: trend analysis. This reflects how real users track narratives and identify early signals in a noisy market.

With expanded datasets and new modeling tests, CAIA v0.2 provides a more complete view into how well AI agents perform in crypto's fast-moving and high-context environment.

1. Overview

Model	Overall %	Answer %	Reasoning %	Tool Use %	Summary
OpenAI o3	45.0	32.5	47.7	50.8	Most complete agent — excels across categories, dominates onchain operations
OpenAI o4 mini	39.2	32.3	40.9	39.8	Strong all-rounder — leads in tokenomics analysis, consistent performance
claude-4-sonnet-no-thinking	36.1	31.7	39.4	33.5	Discovery specialist — best at project research, but tool usage remains unreliable
OpenAI 4.1	27.2	24.9	28.7	26.3	Solid reasoning — handles context well, but struggles with numeric answers
DeepSeek r1	26.4	17.6	29.9	27.5	Decent planner — reasonable task breakdowns, weak on execution
DeepSeek v3	17.5	9.9	21.9	18.6	Lags behind — underperforms overall with some spikes in research

Key Takeaways

Room for improvement across the board: The best run tops out at 45 %, leaving a ~40-point gap to human junior-analyst level. Most of that gap is mundane — unit slips, execution misses, schema errors — not missing domain knowledge.
Suite champions are clear:

• o3 performs best across Onchain Analysis, Trend Analysis, and Overlapping tasks.

• o4 mini is the strongest at interpreting Tokenomics.

• Claude-4 dominates Project Discovery.
Tools dictate the ceiling: There's a strong correlation (r ≈ 0.8) between tool use and final answer quality. Simple improvements, e.g., rate-limit retries, unit sanity checks, and block-range helpers, would lift all models more than prompt tuning.
Reasoning ≠ delivery: DeepSeek models reason nearly as well as the GPT-4 series but fail on output. They often emit hex, raw ABIs, or incomplete fields. A lightweight output linter could recover double-digit performance gains.
Category dynamics:

• Tokenomics has the widest Tool–Answer gap: Supply math and vesting cliffs often break without purpose-built calculators.

• Overlap tasks reveal schema brittleness: A single extra repo link or address variant can collapse an otherwise correct answer.

• Trend analysis tasks mostly fail due to rate limits: Adding exponential back-off would boost performance with minimal engineering lift.

CAIA - Total Score by Model

2. Horizontal Analysis – Model by Model

Model	Best Performance	Areas for Improvement	Representative Task
o3	Highest tool score (50.8 %) and excels at precise onchain look-ups.	Token-econ prose is thinner (9-pt gap to o4 mini).	“What’s the Uniswap V3 router contract address?” (task-id r4567890-…) — o3 scored 98 % with ENS-based RPC lookup; median of peers ≤ 65 %.
o4 mini	Most balanced; wins Tokenomics suite (50 %).	Suffers on rate-limited DEX sweeps.	“What’s Pendle’s router contract address?” (p2345678-…) — only model to hit 90 %+.
Claude-4	#1 in Project Discovery (58 %); concise rhetoric.	Omits startBlock in many calls → partial data.	“List three interoperability competitors to Hyperlane …” (d0227345-…) — clocks 98 %; GPT family 70–75 %.
OpenAI 4.1	Reasoning still solid (28.7 %).	Arithmetic slips on LP/TVL math.	Same Hyperlane task — lands 97 % overall but drops 12 % on liquidity-pool TVL query.
DeepSeek r1	Reasoning chain ≥ 30 %.	Formats numbers in hex, drops decimals.	Timestamp lookup (0f1f8a27-…) — logic perfect; answer off by 1 sec → 0 % score.
DeepSeek v3	Research trivia occasionally shines.	Lowest Answer accuracy (9.9 %); tool time-outs.	“Donut seed raise & investors?” (m9012345-…) — DS v3 scored 83.6 %; others < 65 %.

3. Vertical Analysis – Cross-Model Patterns

3.1 Category leaders

suite	winner	Δ to runner-up
Onchain Analysis	o3 (41.5 %)	+7.8 pt over o4 mini
Tokenomics Deep-Dive	o4 mini (50.0 %)	+8.1 pt over Claude-4
Project Discovery	Claude-4 (57.9 %)	+0.8 pt over o3
Trend Analysis	o3 (46.1 %)	+13.5 pt over o4 mini
Overlap (mixed)	o3 (44.0 %)	+5.3 pt over Claude-4

![CAIA - Per-Category Average Score by Model.png](https://media.cyberconnect.dev/CAIA - Per-Category Average Score by Model.png)

3.2 Metric Spread

Std-dev across models: total ± 10 %, answer ± 9 %, reasoning ± 9.5 %, tool_use ± 11 %.

This shows that tool competence is the largest differentiator.
Correlation: reasoning ↔ tool = 0.78, showing the strongest link among the 3 evaluation dimensions, suggesting that tasks where agents reason well tend to have accurate, well parameterized tool calls. Answer also has positive correlation with both Tool use and reasoning, suggesting that improving the reasoning and tool use has an overall positive effect on improving final output.

3.3 Where Every Model Fails

Archetype	Common Failure	Spreadsheet-backed exemplar (task-id)
Long-range time-series	Fetches daily data but forgets monthly aggregation.	Daily swap volume ETH/USDC (89bebeca-…) — best Answer 39 %, median 5 %.
Unique-address counting	No tool call; Etherscan page-scrape times out.	“Count unique wallets interacting with Ethereum Cyber Bridge on 2025-04-01” (0059aec1-…) — every model Answer ≤ 4 %, Tool = 0 %.
Proxy-contract metadata	Retrieves correct onchain data but mis-formats answer.	AVSDirectory proxy + GitHub repo (f653d8cb-…) — Tool avg 80 %, Answer avg 34 %.

Why the Cyber Bridge Task Collapses

Count unique wallet addresses that interacted (including internal transactions) with the Ethereum Cyber Bridge on UTC 2025-04-01.

Tool_use — every model logged 0 %: none issued an eth_getLogs or Dune query; they resorted to prose guessing.

Answer — Claude-4 scraped a random blog and still earned 4 %. The rest scored 0 %.

Root cause: The bridge contract emits internal calls; naïve Etherscan scraping doesn’t expose them. Agents need a block-range trace_filter or an archived-node internal-txn query, which none attempted.

4. Category Analysis - by Evaluation Dimensions

Category	Avg Answer	Avg Reasoning	Avg Tool	Tool – Answer
Onchain Analysis	31	34	40	+9
Tokenomics	34	41	37	+3
Project Discovery	44	51	46	+2
Overlap (mixed)	27	35	38	+11
Trend Analysis	30	36	44	+14

Tool–Answer gaps are largest in Trend and Overlap. Agents know what the question is asking and what are the tools to use but fumble last-mile aggregation and execution.

For example, on the task “For the top 5 Pump.fun meme tokens by 7-day trading volume from 2025-05-05 to 2025-05-12, provide their contract address"

O3 “partially sourced contract addresses by searching Solscan pages but did not attempt to fetch or convert onchain creation timestamps as required.”
Project Discovery is the easiest suite (mid-40s average) because it relies on open-web search plus summarization—minimal onchain calls needed.
Onchain Analysis underperforms mostly on stable-swap arithmetic; reasoning is underperforming (34 %), and calculations are even worse (31 %).

CAIA Dataset Composition

5. Methodology (unchanged core, refreshed data)

Dataset & suites – 60 tasks spanning Onchain Analysis, Tokenomics Analysis, Project Discovery, Trend Analysis, and Overlap. We’ve added 20 new tasks and 1 new category - trend analysis to introduce more variety into the datasets. (See the breakdown graph below)
Judging dimensions – answer, reasoning, tool-use scored 0–100(%) by ensemble LLM judges (o3, gpt4.1, deepseek r1, claude 4 sonnet).
This update – Scores computed from CAIA updated model testing. Category means and model means are simple un-weighted averages.
Reproducibility – Evaluation harness (caia-benchmark/evaluator.py) and raw JSON outputs remain open-sourced.

What’s next?

Task Expansion

CAIA v0.3 will grow the benchmark from 60 to 80 tasks. Expect new datasets proposed by Alliance members, with added coverage in tokenomics, trend analysis, and project discovery to ensure a more balanced distribution.
Scope Refinement

We will introduce more question types to better reflect real-world analyst workflows. The goal is to increase CAIA’s inclusiveness and ensure it captures the full range of tasks crypto analysts actually face.
Agent Expansion

More crypto-native agents will be evaluated alongside general-purpose models. This will offer a clearer picture of where specialized stacks outperform.
Community Contribution

Alliance participation is growing. Expect more members to contribute tasks, propose improvements, and help developing CAIA.