Appearance
Evaluate historical NER quality by reporting precision, recall and F1 broken down per entity type and per source period — never a single aggregate — and by scoring both strict and relaxed span matching so boundary errors and missed entities are visible separately. A global F1 of 0.9 routinely hides an entity class sitting at 0.4 recall in one century. This guide gives the metrics, the test-set design, and a reproducible checklist that makes your evaluation defensible to a reviewer.
Which metrics actually matter?
The base trio is precision, recall and F1, but the unit of reporting is what counts. Always disaggregate:
text
strict-F1 relaxed-F1 recall
PER 1600s 0.71 0.88 0.74
PER 1800s 0.93 0.96 0.95
ORG 1600s 0.42 0.61 0.39 <-- the real problem
ORG 1800s 0.80 0.88 0.82The aggregate of that table might read 0.84 and look healthy. The breakdown shows 17th-century organisations are barely usable. A single number is not an evaluation; it is a way of not looking.
Strict or relaxed span matching?
Report both, because they answer different questions.
| Mode | Counts a hit when | Tells you |
|---|---|---|
| Strict | boundaries match exactly | are spans precise enough for linking? |
| Relaxed | spans overlap at all | did the model find the entity at all? |
OCR noise produces off-by-one boundaries (Edw ard, trailing punctuation), so strict scores suffer even when the entity was effectively found. A large strict-vs-relaxed gap points straight at a boundary or tokenisation problem rather than a recognition one.
python
def overlap(a, b):
return not (a["end"] <= b["start"] or b["end"] <= a["start"])
def relaxed_match(pred, gold):
return pred["label"] == gold["label"] and overlap(pred, gold)How big and how representative must the gold set be?
Two separate requirements:
- Size: enough to estimate per-class recall — typically a few hundred annotated entities per type and period you care about. Fifty entities give confidence intervals too wide to act on.
- Representativeness: the gold set must mirror the real distribution of OCR quality, genre and period in your corpus. A test set drawn only from clean, recent pages will flatter a model that fails on the hard 17th-century material.
This is the most common reason a model "scores well but fails in production": the test set quietly resembled the training data, not the corpus.
How do you evaluate entity linking separately?
Recognition and linking are different tasks and need different scores. A model can span John Smith perfectly and link it to the wrong John Smith. Measure linking on the subset of correctly-recognised entities:
python
def linking_accuracy(pairs):
correct = sum(1 for p in pairs
if p["span_correct"] and p["pred_qid"] == p["gold_qid"])
spanned = sum(1 for p in pairs if p["span_correct"])
return correct / spanned if spanned else 0.0Report this alongside NER F1 so a strong recogniser cannot mask weak disambiguation.
Can an LLM judge your output?
It can triage, not adjudicate. An LLM judge speeds the first pass over large outputs but introduces its own anachronism and hallucination errors. Anchor any automated scoring to a human-adjudicated gold sample, and report the agreement between the LLM and the humans so readers can weigh the automated figures honestly.
What does a defensible evaluation record contain?
Treat the evaluation itself as a reproducible artefact:
text
eval/
gold_v4.jsonl # stratified by period + type, with offsets
scores_per_class.csv # strict/relaxed P, R, F1 by type and decade
linking_accuracy.txt
model.txt # name, version, date
notes.md # known weak classes, sampling methodAnyone reading it should be able to reconstruct exactly what was measured, on what, and how.
What is the reviewer-ready checklist?
Before you call quality acceptable: per-type and per-period scores reported; strict and relaxed both shown; gold set stratified to match the corpus and large enough for per-class recall; linking scored separately; weak classes named explicitly rather than averaged away; and the whole evaluation versioned and reproducible.
Key Takeaways
- Report precision, recall and F1 per entity type and per period; aggregates hide class collapse.
- Score both strict and relaxed span matching — the gap diagnoses boundary versus recognition errors.
- Size the gold set for per-class recall and stratify it to mirror the real corpus distribution.
- A model that scores well but fails in production usually had an unrepresentative test set.
- Evaluate entity linking separately from span detection so good NER does not mask poor disambiguation.
- Use LLM judges only to triage, anchored to a human gold sample with reported agreement.
- Version the evaluation as a reproducible artefact and name weak classes explicitly.
Frequently Asked Questions
What metrics should I report for historical NER?
Report precision, recall and F1 per entity type and per source period, not just a single aggregate. Historical performance varies sharply by century and genre, and a global F1 hides classes that have collapsed to near-zero recall.
Should I use strict or relaxed span matching?
Report both. Strict (exact-boundary) matching is honest but punishes off-by-one boundaries common with OCR noise; relaxed (overlap) matching shows whether the model found the entity at all. The gap between them is itself diagnostic.
How large should the gold test set be?
Large enough to estimate per-class recall, typically a few hundred annotated entities per type and period you care about. A 50-entity test set gives confidence intervals so wide that the numbers are not decision-grade.
Why does my model score well on the test set but poorly in production?
Usually the test set is not representative: it shares OCR quality, genre or period with the training data but not with the full corpus. Stratify your gold set to mirror the real distribution of your sources.
How do I evaluate entity linking, not just span detection?
Score linking separately from recognition. Measure whether a correctly-spanned entity was linked to the right authority identifier, and report that as its own accuracy figure so a good NER score does not mask poor disambiguation.
Can I trust an LLM to evaluate my NER output?
Only with caution. An LLM judge can speed triage, but it makes anachronism and hallucination errors of its own. Anchor any automated evaluation to a human-adjudicated gold sample and report agreement between them.