As crypto markets mature, the diversity of data is expanding quickly. Tweets, news clips, research threads, and other sources are becoming key signals. As a result, identifying relevant projects, tokens, influential X accounts, and first step toward any higher-level insight. The Crypto NER Benchmark was built to measure how well agents can perform in this domain.
The benchmark grades models on whether they can highlight precise character spans for four entity types that matter most to crypto analysts: Project, Token, Twitter handle, and VC firm. If a model claims that “$ETH” sits between characters 15 to 18, we verify the claim by position.
Replicability is the ultimate goal, so the evaluation pipeline is just a single reproducible script that feeds prediction files into nervaluate, a span level evaluator. nervaluate crunches every prediction into five intuitive buckets—Correct, Incorrect, Missing, Spurious, and Partial.
Code | Meaning |
---|---|
COR | Correct — boundary and type are right |
INC | Incorrect type |
MIS | Missing — gold span not predicted |
SPU | Spurious — predicted span not in gold |
PAR | Partial overlap |
Those counts are then aggregated under four evaluation schemas:
Schema | Boundary rule | Type rule | PAR counted as COR? |
---|---|---|---|
Strict | Exact | Exact | ✗ |
Exact | Exact | ignored | ✗ |
Partial | Exact or partial | ignored | ✔ (PAR = 0.5 credit) |
Type | Any overlap | Exact | ✔ |
For example:
Gold span "Bitcoin"with type Token
Model prediction "Bitcoin" with type Project (boundary exact, wrong type)
Schema | Outcome | Why |
---|---|---|
Strict | INC | Boundary ok ✅ but type mismatched ❌ — fails Strict. |
Exact | COR | Boundary ok ✅; type is ignored → counts as correct. |
Partial | COR (weight 1.0) | Full boundary overlap; type ignored. |
Type | INC | Boundary overlaps ✅ but type mismatched ❌ — Type schema demands correct class. |
Now change the prediction to "Bitco" (partial boundary, still Project as type):
Schema | Outcome | Why |
---|---|---|
Strict | INC | Boundary wrong. |
Exact | INC | Boundary wrong. |
Partial | PAR (weight 0.5) | Partial overlap counts, regardless of type. |
Type | INC | Boundary overlaps ✅ but type wrong ❌. |
Precision = COR / ACT, Recall = COR / POS, F1 = harmonic mean (with PAR worth 0.5 where applicable).
What does this entail?
Snippet from passage #118 (testset)
“Exciting news for DeFi fans! The @Polkadot ecosystem just pushed 850 K DOT into liquidity rewards. Key pools: $WBTC–DOT and $ETH–DOT. Thanks to @_snowbridge, these pairs are super-charged.”
The gold labels are stored as {start: idx, end: idx, label}spans, but asking an LLM to emit character indices is unreliable except for code-enabled agents. So every model is **required only to output **{"exact_text": "label"}in passage order.
During scoring we regex scan the passage for each surface form, left-to-right, to recover the {start, end}indices before passing them to nervaluate. See example below:
Gold entities (order preserved)
{"@Polkadot":"X"}
{"DOT":"Token"}
{"$WBTC":"Token"}
{"DOT":"Token"}
{"$ETH":"Token"}
{"DOT":"Token"}
{"@_snowbridge":"X"}
DeepSeek-v3 prediction (zero-shot prompt)
Model output | nervaluate label |
---|---|
{"@Polkadot":"Twitter"} | COR |
{"DOT":"Token"} | COR |
{"$WBTC":"Token"} | COR |
-- (missed second “DOT”) | MIS |
{"$ETH":"Token"} | COR |
{"DOT":"Token"} | COR |
{"@_snowbridge":"Twitter"} | COR |
Result under Strict for this snippet: 6 COR, 1 MIS → Precision = 1.00, Recall = 0.86, F1 = 0.92.
Under Exact the score is identical (all types correct here); under Partial and Type it remains 0.92 because MIS remains a miss.
By reading all four F1 scores together, you can diagnose whether a model struggles with boundary drift, type confusion, or both.
The Crypto NER benchmark is still in its first iteration, and it already sheds a light on where today’s language models excel and where they miss the mark in Crypto context. Together with our partners in CAIBA, we’re actively expanding the dataset, revisiting the label scheme, and onboarding Crypto models and agents. Expect better benchmarks in the months ahead!
Check our public HuggingFace: https://huggingface.co/datasets/cyberco/NER-benchmark-750