CAIA v0.2: Closing Gaps, Raising Floors

CAIA v0.2: Closing Gaps, Raising Floors

Jun 16, 2025

Following the release of CAIA v0.1, our first benchmark purpose-built for evaluating crypto-native AI agents, we are releasing CAIA v0.2, a major update that increases both the breadth and difficulty of the benchmark.

CAIA (a benchmark for Crypto AI Agents) evaluates AI agents on tasks that reflect the daily responsibilities of an entry-level crypto analyst. These tasks require reasoning, contextual understanding, and domain-specific knowledge across three core areas: onchain analysis, tokenomics diagnostics, and project research.

Version 0.2 increases task coverage by 50 percent, growing from 40 to 60 tasks. It also adds a new focus area: trend analysis. This reflects how real users track narratives and identify early signals in a noisy market.

With expanded datasets and new modeling tests, CAIA v0.2 provides a more complete view into how well AI agents perform in crypto's fast-moving and high-context environment.

1. Overview

ModelOverall %Answer %Reasoning %Tool Use %Summary
OpenAI o345.032.547.750.8Most complete agent — excels across categories, dominates onchain operations
OpenAI o4 mini39.232.340.939.8Strong all-rounder — leads in tokenomics analysis, consistent performance
claude-4-sonnet-no-thinking36.131.739.433.5Discovery specialist — best at project research, but tool usage remains unreliable
OpenAI 4.127.224.928.726.3Solid reasoning — handles context well, but struggles with numeric answers
DeepSeek r126.417.629.927.5Decent planner — reasonable task breakdowns, weak on execution
DeepSeek v317.59.921.918.6Lags behind — underperforms overall with some spikes in research

Key Takeaways

  • Room for improvement across the board: The best run tops out at 45 %, leaving a ~40-point gap to human junior-analyst level. Most of that gap is mundane — unit slips, execution misses, schema errors — not missing domain knowledge.

  • Suite champions are clear:

    o3 performs best across Onchain Analysis, Trend Analysis, and Overlapping tasks.

    o4 mini is the strongest at interpreting Tokenomics.

    Claude-4 dominates Project Discovery.

  • Tools dictate the ceiling: There's a strong correlation (r ≈ 0.8) between tool use and final answer quality. Simple improvements, e.g., rate-limit retries, unit sanity checks, and block-range helpers, would lift all models more than prompt tuning.

  • Reasoning ≠ delivery: DeepSeek models reason nearly as well as the GPT-4 series but fail on output. They often emit hex, raw ABIs, or incomplete fields. A lightweight output linter could recover double-digit performance gains.

  • Category dynamics:

    Tokenomics has the widest Tool–Answer gap: Supply math and vesting cliffs often break without purpose-built calculators.

    Overlap tasks reveal schema brittleness: A single extra repo link or address variant can collapse an otherwise correct answer.

    Trend analysis tasks mostly fail due to rate limits: Adding exponential back-off would boost performance with minimal engineering lift.

CAIA - Total Score by Model

2. Horizontal Analysis – Model by Model

ModelBest PerformanceAreas for ImprovementRepresentative Task
o3Highest tool score (50.8 %) and excels at precise onchain look-ups.Token-econ prose is thinner (9-pt gap to o4 mini).“What’s the Uniswap V3 router contract address?” (task-id r4567890-…) — o3 scored 98 % with ENS-based RPC lookup; median of peers ≤ 65 %.
o4 miniMost balanced; wins Tokenomics suite (50 %).Suffers on rate-limited DEX sweeps.“What’s Pendle’s router contract address?” (p2345678-…) — only model to hit 90 %+.
Claude-4#1 in Project Discovery (58 %); concise rhetoric.Omits startBlock in many calls → partial data.“List three interoperability competitors to Hyperlane …” (d0227345-…) — clocks 98 %; GPT family 70–75 %.
OpenAI 4.1Reasoning still solid (28.7 %).Arithmetic slips on LP/TVL math.Same Hyperlane task — lands 97 % overall but drops 12 % on liquidity-pool TVL query.
DeepSeek r1Reasoning chain ≥ 30 %.Formats numbers in hex, drops decimals.Timestamp lookup (0f1f8a27-…) — logic perfect; answer off by 1 sec → 0 % score.
DeepSeek v3Research trivia occasionally shines.Lowest Answer accuracy (9.9 %); tool time-outs.“Donut seed raise & investors?” (m9012345-…) — DS v3 scored 83.6 %; others < 65 %.

CAIA3categories.png

3. Vertical Analysis – Cross-Model Patterns

3.1 Category leaders

suitewinnerΔ to runner-up
Onchain Analysiso3 (41.5 %)+7.8 pt over o4 mini
Tokenomics Deep-Diveo4 mini (50.0 %)+8.1 pt over Claude-4
Project DiscoveryClaude-4 (57.9 %)+0.8 pt over o3
Trend Analysiso3 (46.1 %)+13.5 pt over o4 mini
Overlap (mixed)o3 (44.0 %)+5.3 pt over Claude-4

![CAIA - Per-Category Average Score by Model.png](https://media.cyberconnect.dev/CAIA - Per-Category Average Score by Model.png)

3.2 Metric Spread

  • Std-dev across models: total ± 10 %, answer ± 9 %, reasoning ± 9.5 %, tool_use ± 11 %.

    This shows that tool competence is the largest differentiator.

  • Correlation: reasoning ↔ tool = 0.78, showing the strongest link among the 3 evaluation dimensions, suggesting that tasks where agents reason well tend to have accurate, well parameterized tool calls. Answer also has positive correlation with both Tool use and reasoning, suggesting that improving the reasoning and tool use has an overall positive effect on improving final output.

3.3 Where Every Model Fails

ArchetypeCommon FailureSpreadsheet-backed exemplar (task-id)
Long-range time-seriesFetches daily data but forgets monthly aggregation.Daily swap volume ETH/USDC (89bebeca-…) — best Answer 39 %, median 5 %.
Unique-address countingNo tool call; Etherscan page-scrape times out.“Count unique wallets interacting with Ethereum Cyber Bridge on 2025-04-01” (0059aec1-…) — every model Answer ≤ 4 %, Tool = 0 %.
Proxy-contract metadataRetrieves correct onchain data but mis-formats answer.AVSDirectory proxy + GitHub repo (f653d8cb-…) — Tool avg 80 %, Answer avg 34 %.

Why the Cyber Bridge Task Collapses

Count unique wallet addresses that interacted (including internal transactions) with the Ethereum Cyber Bridge on UTC 2025-04-01.

Tool_use — every model logged 0 %: none issued an eth_getLogs or Dune query; they resorted to prose guessing.

Answer — Claude-4 scraped a random blog and still earned 4 %. The rest scored 0 %.

Root cause: The bridge contract emits internal calls; naïve Etherscan scraping doesn’t expose them. Agents need a block-range trace_filter or an archived-node internal-txn query, which none attempted.

4. Category Analysis - by Evaluation Dimensions

CategoryAvg AnswerAvg ReasoningAvg ToolTool – Answer
Onchain Analysis313440+9
Tokenomics344137+3
Project Discovery445146+2
Overlap (mixed)273538+11
Trend Analysis303644+14
  • Tool–Answer gaps are largest in Trend and Overlap. Agents know what the question is asking and what are the tools to use but fumble last-mile aggregation and execution.

    For example, on the task “For the top 5 Pump.fun meme tokens by 7-day trading volume from 2025-05-05 to 2025-05-12, provide their contract address"

    O3 “partially sourced contract addresses by searching Solscan pages but did not attempt to fetch or convert onchain creation timestamps as required.”

  • Project Discovery is the easiest suite (mid-40s average) because it relies on open-web search plus summarization—minimal onchain calls needed.

  • Onchain Analysis underperforms mostly on stable-swap arithmetic; reasoning is underperforming (34 %), and calculations are even worse (31 %).

CAIA Dataset Composition

5. Methodology (unchanged core, refreshed data)

  1. Dataset & suites – 60 tasks spanning Onchain Analysis, Tokenomics Analysis, Project Discovery, Trend Analysis, and Overlap. We’ve added 20 new tasks and 1 new category - trend analysis to introduce more variety into the datasets. (See the breakdown graph below)
  2. Judging dimensionsanswer, reasoning, tool-use scored 0–100(%) by ensemble LLM judges (o3, gpt4.1, deepseek r1, claude 4 sonnet).
  3. This update – Scores computed from CAIA updated model testing. Category means and model means are simple un-weighted averages.
  4. Reproducibility – Evaluation harness (caia-benchmark/evaluator.py) and raw JSON outputs remain open-sourced.

What’s next?

  • Task Expansion

    CAIA v0.3 will grow the benchmark from 60 to 80 tasks. Expect new datasets proposed by Alliance members, with added coverage in tokenomics, trend analysis, and project discovery to ensure a more balanced distribution.

  • Scope Refinement  

    We will introduce more question types to better reflect real-world analyst workflows. The goal is to increase CAIA’s inclusiveness and ensure it captures the full range of tasks crypto analysts actually face.

  • Agent Expansion 

    More crypto-native agents will be evaluated alongside general-purpose models. This will offer a clearer picture of where specialized stacks outperform.

  • Community Contribution

    Alliance participation is growing. Expect more members to contribute tasks, propose improvements, and help developing CAIA.