Crypto Named Entity Recognition Benchmark v0.1

Crypto Named Entity Recognition Benchmark v0.1

Announcements
Jul 3, 2025

Introducing the Crypto NER Benchmark

As crypto markets mature, the diversity of data is expanding quickly. Tweets, news clips, research threads, and other sources  are becoming key signals. As a result, identifying relevant projects, tokens, influential X accounts, and first step toward any higher-level insight. The Crypto NER Benchmark was built to measure how well agents can perform in this domain.

The benchmark grades models on whether they can highlight precise character spans for four entity types that matter most to crypto analysts: Project, Token, Twitter handle, and VC firm. If a model claims that “$ETH” sits between characters 15 to 18, we verify the claim by position.

NER.png

Key Takeaways

  • DeepSeek v3 wins on both boundary and type. Its recall on Project + Token classes is ≳ 82 %, and it never mislabeled a VC.
  • GPT 4.1 trails by 6 pp in Strict but closes the gap to 4 pp under the looser schemas, suggesting type confusion is its main weakness.
  • Mini vs full GPT 4.1. The smaller checkpoint loses ≈ 5 pp across the board, hinting that there’s still a lot emerging crypto entities that agents need to catchup.
  • GPT 4o struggles on everything. A mix of boundary drift (Exact vs Partial gap) and mistypes (Strict vs Exact gap).

Datasets at a glance

  • 750 passages (containing tweets, news articles, analyst write-ups).
  • 7,400+ entity mentions covering
    • Project – blockchains, dApps, CEXs, DAOs
    • Token – tickers or symbols (always prefixed with $ if they appear that way)
    • Twitter – handles starting with @
    • VC – venture-capital firms and funds
  • Duplicates are kept: every mention must be tagged; “Bitcoin … Bitcoin …” counts twice.
  • Train / test = 594 / 156 passages. All numbers below are on the held out testset.

How we score models—beyond simple F1

Replicability is the ultimate goal, so the evaluation pipeline is just a single reproducible script that feeds prediction files into nervaluate, a span level evaluator. nervaluate crunches every prediction into five intuitive buckets—Correct, Incorrect, Missing, Spurious, and Partial.

CodeMeaning
CORCorrect — boundary and type are right
INCIncorrect type
MISMissing — gold span not predicted
SPUSpurious — predicted span not in gold
PARPartial overlap

Those counts are then aggregated under four evaluation schemas:

SchemaBoundary ruleType rulePAR counted as COR?
StrictExactExact
ExactExactignored
PartialExact or partialignored✔ (PAR = 0.5 credit)
TypeAny overlapExact

For example:

Gold span "Bitcoin"with type Token

Model prediction "Bitcoin" with type Project (boundary exact, wrong type)

SchemaOutcomeWhy
StrictINCBoundary ok ✅ but type mismatched ❌ — fails Strict.
ExactCORBoundary ok ✅; type is ignored → counts as correct.
PartialCOR (weight 1.0)Full boundary overlap; type ignored.
TypeINCBoundary overlaps ✅ but type mismatched ❌ — Type schema demands correct class.

Now change the prediction to "Bitco" (partial boundary, still Project as type):

SchemaOutcomeWhy
StrictINCBoundary wrong.
ExactINCBoundary wrong.
PartialPAR (weight 0.5)Partial overlap counts, regardless of type.
TypeINCBoundary overlaps ✅ but type wrong ❌.

Precision = COR / ACT, Recall = COR / POS, F1 = harmonic mean (with PAR worth 0.5 where applicable).

What does this entail?

  • Low Strict F1 but high Exact F1 ⇒ mostly type mistakes.
  • Big gap between Exact and Partial ⇒ boundary drift.
  • Type F1 isolates classification skill when boundaries overlap.

A concrete evaluation walk-through

Snippet from passage #118 (testset)

“Exciting news for DeFi fans! The @Polkadot ecosystem just pushed 850 K DOT into liquidity rewards. Key pools: $WBTC–DOT and $ETH–DOT. Thanks to @_snowbridge, these pairs are super-charged.”

The gold labels are stored as {start: idx, end: idx, label}spans, but asking an LLM to emit character indices is unreliable except for code-enabled agents. So every model is **required only to output **{"exact_text": "label"}in passage order.

During scoring we regex scan the passage for each surface form, left-to-right, to recover the {start, end}indices before passing them to nervaluate. See example below:

Gold entities (order preserved)

{"@Polkadot":"X"}
{"DOT":"Token"}
{"$WBTC":"Token"}
{"DOT":"Token"}
{"$ETH":"Token"}
{"DOT":"Token"}
{"@_snowbridge":"X"}

DeepSeek-v3 prediction (zero-shot prompt)

Model outputnervaluate label
{"@Polkadot":"Twitter"}COR
{"DOT":"Token"}COR
{"$WBTC":"Token"}COR
-- (missed second “DOT”)MIS
{"$ETH":"Token"}COR
{"DOT":"Token"}COR
{"@_snowbridge":"Twitter"}COR

Result under Strict for this snippet: 6 COR, 1 MIS → Precision = 1.00, Recall = 0.86, F1 = 0.92.

Under Exact the score is identical (all types correct here); under Partial and Type it remains 0.92 because MIS remains a miss.

By reading all four F1 scores together, you can diagnose whether a model struggles with boundary drift, type confusion, or both.

Where the benchmark still needs improvements

  • Coverage. Social media text dominates the datasets, leaving official documents, whitepapers, and GitHub issues under represented.
  • Model sampling. Results reflect just a handful of SOTA LLMs and therefore do not capture the full diversity of approaches in the industry.
  • Static ground truth. The crypto industry evolves fast as thousands if not hundreds of tokens are created daily, so periodic re-annotation is essential to keep scores meaningful.

The Crypto NER benchmark is still in its first iteration, and it already sheds a light on where today’s language models excel and where they miss the mark in Crypto context. Together with our partners in CAIBA, we’re actively expanding the dataset, revisiting the label scheme, and onboarding Crypto models and agents. Expect better benchmarks in the months ahead!

Check our public HuggingFace: https://huggingface.co/datasets/cyberco/NER-benchmark-750