Crypto Named Entity Recognition Benchmark v0.1

Introducing the Crypto NER Benchmark

As crypto markets mature, the diversity of data is expanding quickly. Tweets, news clips, research threads, and other sources are becoming key signals. As a result, identifying relevant projects, tokens, influential X accounts, and first step toward any higher-level insight. The Crypto NER Benchmark was built to measure how well agents can perform in this domain.

The benchmark grades models on whether they can highlight precise character spans for four entity types that matter most to crypto analysts: Project, Token, Twitter handle, and VC firm. If a model claims that “$ETH” sits between characters 15 to 18, we verify the claim by position.

Key Takeaways

DeepSeek v3 wins on both boundary and type. Its recall on Project + Token classes is ≳ 82 %, and it never mislabeled a VC.
GPT 4.1 trails by 6 pp in Strict but closes the gap to 4 pp under the looser schemas, suggesting type confusion is its main weakness.
Mini vs full GPT 4.1. The smaller checkpoint loses ≈ 5 pp across the board, hinting that there’s still a lot emerging crypto entities that agents need to catchup.
GPT 4o struggles on everything. A mix of boundary drift (Exact vs Partial gap) and mistypes (Strict vs Exact gap).

Datasets at a glance

750 passages (containing tweets, news articles, analyst write-ups).
7,400+ entity mentions covering
- Project – blockchains, dApps, CEXs, DAOs
- Token – tickers or symbols (always prefixed with $ if they appear that way)
- Twitter – handles starting with @
- VC – venture-capital firms and funds
Duplicates are kept: every mention must be tagged; “Bitcoin … Bitcoin …” counts twice.
Train / test = 594 / 156 passages. All numbers below are on the held out testset.

How we score models—beyond simple F1

Replicability is the ultimate goal, so the evaluation pipeline is just a single reproducible script that feeds prediction files into nervaluate, a span level evaluator. nervaluate crunches every prediction into five intuitive buckets—Correct, Incorrect, Missing, Spurious, and Partial.

Code	Meaning
COR	Correct — boundary and type are right
INC	Incorrect type
MIS	Missing — gold span not predicted
SPU	Spurious — predicted span not in gold
PAR	Partial overlap

Those counts are then aggregated under four evaluation schemas:

Schema	Boundary rule	Type rule	PAR counted as COR?
Strict	Exact	Exact	✗
Exact	Exact	ignored	✗
Partial	Exact or partial	ignored	✔ (PAR = 0.5 credit)
Type	Any overlap	Exact	✔

For example:

Gold span "Bitcoin"with type Token

Model prediction "Bitcoin" with type Project (boundary exact, wrong type)

Schema	Outcome	Why
Strict	INC	Boundary ok ✅ but type mismatched ❌ — fails Strict.
Exact	COR	Boundary ok ✅; type is ignored → counts as correct.
Partial	COR (weight 1.0)	Full boundary overlap; type ignored.
Type	INC	Boundary overlaps ✅ but type mismatched ❌ — Type schema demands correct class.

Now change the prediction to "Bitco" (partial boundary, still Project as type):

Schema	Outcome	Why
Strict	INC	Boundary wrong.
Exact	INC	Boundary wrong.
Partial	PAR (weight 0.5)	Partial overlap counts, regardless of type.
Type	INC	Boundary overlaps ✅ but type wrong ❌.

Precision = COR / ACT, Recall = COR / POS, F1 = harmonic mean (with PAR worth 0.5 where applicable).

What does this entail?

Low Strict F1 but high Exact F1 ⇒ mostly type mistakes.
Big gap between Exact and Partial ⇒ boundary drift.
Type F1 isolates classification skill when boundaries overlap.

A concrete evaluation walk-through

Snippet from passage #118 (testset)

“Exciting news for DeFi fans! The @Polkadot ecosystem just pushed 850 K DOT into liquidity rewards. Key pools: $WBTC–DOT and $ETH–DOT. Thanks to @_snowbridge, these pairs are super-charged.”

The gold labels are stored as {start: idx, end: idx, label}spans, but asking an LLM to emit character indices is unreliable except for code-enabled agents. So every model is **required only to output **{"exact_text": "label"}in passage order.

During scoring we regex scan the passage for each surface form, left-to-right, to recover the {start, end}indices before passing them to nervaluate. See example below:

Gold entities (order preserved)

{"@Polkadot":"X"}
{"DOT":"Token"}
{"$WBTC":"Token"}
{"DOT":"Token"}
{"$ETH":"Token"}
{"DOT":"Token"}
{"@_snowbridge":"X"}

DeepSeek-v3 prediction (zero-shot prompt)

Model output	nervaluate label
{"@Polkadot":"Twitter"}	COR
{"DOT":"Token"}	COR
{"$WBTC":"Token"}	COR
-- (missed second “DOT”)	MIS
{"$ETH":"Token"}	COR
{"DOT":"Token"}	COR
{"@_snowbridge":"Twitter"}	COR

Result under Strict for this snippet: 6 COR, 1 MIS → Precision = 1.00, Recall = 0.86, F1 = 0.92.

Under Exact the score is identical (all types correct here); under Partial and Type it remains 0.92 because MIS remains a miss.

By reading all four F1 scores together, you can diagnose whether a model struggles with boundary drift, type confusion, or both.

Where the benchmark still needs improvements

Coverage. Social media text dominates the datasets, leaving official documents, whitepapers, and GitHub issues under represented.
Model sampling. Results reflect just a handful of SOTA LLMs and therefore do not capture the full diversity of approaches in the industry.
Static ground truth. The crypto industry evolves fast as thousands if not hundreds of tokens are created daily, so periodic re-annotation is essential to keep scores meaningful.

The Crypto NER benchmark is still in its first iteration, and it already sheds a light on where today’s language models excel and where they miss the mark in Crypto context. Together with our partners in CAIBA, we’re actively expanding the dataset, revisiting the label scheme, and onboarding Crypto models and agents. Expect better benchmarks in the months ahead!

Check our public HuggingFace: https://huggingface.co/datasets/cyberco/NER-benchmark-750