How to Evaluate NLP on historical text

To evaluate NLP on historical text, build a small in-domain gold set drawn from your own noisy sources, score the tool on the same kind of text it will run on in production, and report per-class precision, recall and F1 rather than a single accuracy number. Published benchmark scores measured on clean modern text are nearly worthless for predicting behaviour on variably spelled, OCR-degraded historical material. The whole craft is constructing a realistic gold set and reading the per-class numbers honestly.

Why are standard benchmarks misleading?

A POS tagger advertised at 97 percent accuracy was almost certainly measured on modern newswire. On a seventeenth-century pamphlet with long-s, erratic capitalisation and OCR garble, the same tagger may drop to the low 80s or worse — and the failures cluster exactly on the archaic forms you care about. The headline number hides this. You cannot borrow someone else's evaluation; you must run your own on your data.

Step 1: build a realistic gold set

Sample 200 to 500 items from the actual collection, including the messy pages, not just the clean ones. Annotate them by hand against clear guidelines. If two people annotate, measure inter-annotator agreement (Cohen's kappa above 0.8 is a reasonable bar) before trusting the gold.

python

# gold.jsonl  - one item per line, drawn from REAL production pages
# {"text": "Iohn Smyth of Yorke", "spans": [[0,9,"PERSON"],[13,18,"GPE"]]}
import json
gold = [json.loads(l) for l in open("gold.jsonl", encoding="utf-8")]
print(f"{len(gold)} gold items loaded")

Step 2: match evaluation text to production text

Score on the same representation you deploy on. If production ingests raw OCR, your gold text must be raw OCR. Evaluating a search-indexing pipeline on hand-cleaned transcriptions produces an optimistic number that evaporates the moment real documents arrive. This single mismatch is the most common reason "evaluated" pipelines disappoint.

Step 3: choose metrics that expose the failures

Task	Primary metric	Watch out for
POS tagging	Per-tag accuracy	Aggregate hides rare-tag collapse
NER	Span-level P/R/F1	Partial overlaps, boundary errors
Normalisation	Word accuracy + tokens-made-worse	Over-correction
Lemmatisation	Lemma accuracy	Ambiguous forms
OCR feeding NLP	CER/WER upstream	Error compounds downstream

Always report per-class precision and recall. Accuracy on imbalanced historical data flatters the model: a tagger that ignores a rare class entirely can still post a high aggregate score.

How do I score spans correctly for NER?

Use span-level matching, and decide explicitly whether you count partial overlaps. Strict exact-match is the honest default for scholarly work, but report both if boundaries are genuinely fuzzy:

python

def prf(pred, gold):
    tp = len(pred & gold)
    p = tp / len(pred) if pred else 0.0
    r = tp / len(gold) if gold else 0.0
    f = 2 * p * r / (p + r) if (p + r) else 0.0
    return round(p, 3), round(r, 3), round(f, 3)

# pred / gold are sets of (start, end, label) tuples
print(prf({(0, 9, "PERSON")}, {(0, 9, "PERSON"), (13, 18, "GPE")}))

How do I keep the evaluation reproducible?

Freeze and version four things: the gold file, the model weights (by hash), the library versions, and the random seed. Commit the gold set and the scoring script together so a colleague — or you in two years — can rerun and get identical numbers. An evaluation you cannot reproduce is an anecdote, not evidence, and reviewers increasingly ask for the script.

What baseline tells me the model is worth it?

Always compare against a trivial baseline: a most-frequent-tag tagger, a gazetteer lookup, or a regex. If your transformer cannot clearly beat that on your gold set, the added cost, opacity and maintenance burden are not justified for your collection. Beating a strong baseline by a meaningful margin on realistic data is the only result that should make you confident.

Key Takeaways

Never reuse benchmark scores; build a gold set from your own noisy collection.
Annotate 200 to 500 in-domain items and check inter-annotator agreement.
Evaluate on the same text representation you deploy on (raw OCR if that is production).
Report per-class precision, recall and F1, not a single accuracy figure.
Use strict span matching for NER and be explicit about partial-overlap rules.
Freeze gold, weights, versions and seed so the evaluation is reproducible.
Compare against a trivial baseline; complexity is only justified if it clearly wins.

Frequently Asked Questions

Why can't I just reuse standard NLP benchmark scores?

Standard benchmarks measure performance on clean modern text, which tells you almost nothing about how a tool behaves on noisy, variably spelled historical sources. You must build a small in-domain gold set from your own collection and measure on that.

How big should my gold evaluation set be?

For a usable signal, hand-annotate 200 to 500 items (tokens, spans, or sentences depending on the task). That is enough to separate a clearly good model from a clearly bad one, though confidence intervals on rare classes will still be wide.

Should I evaluate on clean transcriptions or noisy OCR?

Evaluate on the same kind of text you will run in production. If your pipeline ingests raw OCR, your gold set must be raw OCR too; scoring on clean text gives an optimistic number that collapses in real use.

Why is accuracy a misleading metric here?

Historical data is usually imbalanced, so a tagger can score high accuracy while failing on the rare classes you care about. Report per-class precision, recall and F1, and inspect the rare classes directly instead of trusting an aggregate.

How do I make my evaluation reproducible?

Freeze the gold set, pin model and library versions, fix random seeds, and version the gold file alongside your code. Anyone should be able to rerun your script and reproduce the same numbers exactly.

What is a fair baseline to compare against?

Use a simple, transparent baseline such as a most-frequent-tag tagger or a regex rule. If a heavy transformer cannot beat that baseline on your data, the extra complexity is not justified for your collection.

Why are standard benchmarks misleading? ​

Step 1: build a realistic gold set ​

Step 2: match evaluation text to production text ​

Step 3: choose metrics that expose the failures ​

How do I score spans correctly for NER? ​

How do I keep the evaluation reproducible? ​

What baseline tells me the model is worth it? ​

Key Takeaways ​

Frequently Asked Questions ​

Why can't I just reuse standard NLP benchmark scores? ​

How big should my gold evaluation set be? ​

Should I evaluate on clean transcriptions or noisy OCR? ​

Why is accuracy a misleading metric here? ​

How do I make my evaluation reproducible? ​

What is a fair baseline to compare against? ​

Related reading ​