Best Practices to Evaluate historical NER quality

Evaluate historical NER quality by reporting precision, recall and F1 broken down per entity type and per source period — never a single aggregate — and by scoring both strict and relaxed span matching so boundary errors and missed entities are visible separately. A global F1 of 0.9 routinely hides an entity class sitting at 0.4 recall in one century. This guide gives the metrics, the test-set design, and a reproducible checklist that makes your evaluation defensible to a reviewer.

Which metrics actually matter?

The base trio is precision, recall and F1, but the unit of reporting is what counts. Always disaggregate:

text

            strict-F1   relaxed-F1   recall
PER  1600s    0.71        0.88        0.74
PER  1800s    0.93        0.96        0.95
ORG  1600s    0.42        0.61        0.39   <-- the real problem
ORG  1800s    0.80        0.88        0.82

The aggregate of that table might read 0.84 and look healthy. The breakdown shows 17th-century organisations are barely usable. A single number is not an evaluation; it is a way of not looking.

Strict or relaxed span matching?

Report both, because they answer different questions.

Mode	Counts a hit when	Tells you
Strict	boundaries match exactly	are spans precise enough for linking?
Relaxed	spans overlap at all	did the model find the entity at all?

OCR noise produces off-by-one boundaries (Edw ard, trailing punctuation), so strict scores suffer even when the entity was effectively found. A large strict-vs-relaxed gap points straight at a boundary or tokenisation problem rather than a recognition one.

python

def overlap(a, b):
    return not (a["end"] <= b["start"] or b["end"] <= a["start"])

def relaxed_match(pred, gold):
    return pred["label"] == gold["label"] and overlap(pred, gold)

How big and how representative must the gold set be?

Two separate requirements:

Size: enough to estimate per-class recall — typically a few hundred annotated entities per type and period you care about. Fifty entities give confidence intervals too wide to act on.
Representativeness: the gold set must mirror the real distribution of OCR quality, genre and period in your corpus. A test set drawn only from clean, recent pages will flatter a model that fails on the hard 17th-century material.

This is the most common reason a model "scores well but fails in production": the test set quietly resembled the training data, not the corpus.

How do you evaluate entity linking separately?

Recognition and linking are different tasks and need different scores. A model can span John Smith perfectly and link it to the wrong John Smith. Measure linking on the subset of correctly-recognised entities:

python

def linking_accuracy(pairs):
    correct = sum(1 for p in pairs
                  if p["span_correct"] and p["pred_qid"] == p["gold_qid"])
    spanned = sum(1 for p in pairs if p["span_correct"])
    return correct / spanned if spanned else 0.0

Report this alongside NER F1 so a strong recogniser cannot mask weak disambiguation.

Can an LLM judge your output?

It can triage, not adjudicate. An LLM judge speeds the first pass over large outputs but introduces its own anachronism and hallucination errors. Anchor any automated scoring to a human-adjudicated gold sample, and report the agreement between the LLM and the humans so readers can weigh the automated figures honestly.

What does a defensible evaluation record contain?

Treat the evaluation itself as a reproducible artefact:

text

eval/
  gold_v4.jsonl          # stratified by period + type, with offsets
  scores_per_class.csv   # strict/relaxed P, R, F1 by type and decade
  linking_accuracy.txt
  model.txt              # name, version, date
  notes.md               # known weak classes, sampling method

Anyone reading it should be able to reconstruct exactly what was measured, on what, and how.

What is the reviewer-ready checklist?

Before you call quality acceptable: per-type and per-period scores reported; strict and relaxed both shown; gold set stratified to match the corpus and large enough for per-class recall; linking scored separately; weak classes named explicitly rather than averaged away; and the whole evaluation versioned and reproducible.

Key Takeaways

Report precision, recall and F1 per entity type and per period; aggregates hide class collapse.
Score both strict and relaxed span matching — the gap diagnoses boundary versus recognition errors.
Size the gold set for per-class recall and stratify it to mirror the real corpus distribution.
A model that scores well but fails in production usually had an unrepresentative test set.
Evaluate entity linking separately from span detection so good NER does not mask poor disambiguation.
Use LLM judges only to triage, anchored to a human gold sample with reported agreement.
Version the evaluation as a reproducible artefact and name weak classes explicitly.

Frequently Asked Questions

What metrics should I report for historical NER?

Report precision, recall and F1 per entity type and per source period, not just a single aggregate. Historical performance varies sharply by century and genre, and a global F1 hides classes that have collapsed to near-zero recall.

Should I use strict or relaxed span matching?

Report both. Strict (exact-boundary) matching is honest but punishes off-by-one boundaries common with OCR noise; relaxed (overlap) matching shows whether the model found the entity at all. The gap between them is itself diagnostic.

How large should the gold test set be?

Large enough to estimate per-class recall, typically a few hundred annotated entities per type and period you care about. A 50-entity test set gives confidence intervals so wide that the numbers are not decision-grade.

Why does my model score well on the test set but poorly in production?

Usually the test set is not representative: it shares OCR quality, genre or period with the training data but not with the full corpus. Stratify your gold set to mirror the real distribution of your sources.

How do I evaluate entity linking, not just span detection?

Score linking separately from recognition. Measure whether a correctly-spanned entity was linked to the right authority identifier, and report that as its own accuracy figure so a good NER score does not mask poor disambiguation.

Can I trust an LLM to evaluate my NER output?

Only with caution. An LLM judge can speed triage, but it makes anachronism and hallucination errors of its own. Anchor any automated evaluation to a human-adjudicated gold sample and report agreement between them.

Which metrics actually matter? ​

Strict or relaxed span matching? ​

How big and how representative must the gold set be? ​

How do you evaluate entity linking separately? ​

Can an LLM judge your output? ​

What does a defensible evaluation record contain? ​

What is the reviewer-ready checklist? ​

Key Takeaways ​

Frequently Asked Questions ​

What metrics should I report for historical NER? ​

Should I use strict or relaxed span matching? ​

How large should the gold test set be? ​

Why does my model score well on the test set but poorly in production? ​

How do I evaluate entity linking, not just span detection? ​

Can I trust an LLM to evaluate my NER output? ​

Related reading ​