Resolve coreference in historical text: A Practical Guide

Q: Why is coreference harder on historical sources?

Long sentences, formulaic legal phrasing like 'the aforesaid', sparse pronouns, gendered ambiguity, and OCR noise all break modern coreference models, which were trained on contemporary prose.

Q: What rule helps most on legal and administrative text?

A rule that binds 'the said X', 'aforesaid X', and 'the foresaid' back to the most recent matching antecedent recovers a large share of links that neural models miss in deeds, wills, and court rolls.

Resolving coreference in historical text means linking every mention that points to the same entity — proper names, pronouns, and phrases like "the said widow" — into a single chain, so you can follow who "he" or "she" refers to across a passage. The practical workflow is: run a neural coreference model for a baseline, layer rules for archaic legal anaphora such as "the said" and "aforesaid", merge the chains, then validate against a hand-annotated sample. Coreference is the step that turns scattered mentions into a coherent account of a person within a document.

Without it, your extracted "John" and the "he" who later "did bequeath his lands" sit in your data as unrelated fragments. Coreference stitches them together — and historical text fights you at every turn.

What exactly does coreference resolution produce?

It produces chains: clusters of mentions that all refer to one entity. In "John Pell came before the court. He confessed that the said Pell owed forty shillings", the chain is {John Pell, He, the said Pell}. Pronouns and legal back-references collapse into the named entity, and downstream tools can treat them as one person.

Why is historical text so hard for coreference?

Several pressures stack up:

Formulaic anaphora — "the said", "aforesaid", "the foresaid party" replace pronouns in legal records, and modern models do not know these conventions.
Long, run-on sentences that push antecedents far from their references.
Sparse or ambiguous pronouns, including gendered forms where the referent is unclear.
OCR and HTR noise that fragments names a model would otherwise anchor to.

A model trained on modern news handles "he/she/it" but has never met "the foresaid John".

How do I get a baseline with existing tools?

Start with a neural component for the easy cases, then inspect what it misses.

python

import spacy
nlp = spacy.load("en_core_web_trf")
nlp.add_pipe("experimental_coref")   # or use an AllenNLP coref model

doc = nlp(text)
for chain in doc.spans:
    if chain.startswith("coref"):
        print([span.text for span in doc.spans[chain]])

Treat this output as a draft. On clean modern-style passages it does well; on a 1660 deed it will leave "the said tenant" stranded.

What rules recover the links models miss?

A small, targeted rule set is the highest-value addition for administrative and legal sources. The single most productive rule binds explicit back-references to their nearest matching antecedent:

python

import re

def link_said(tokens, chains):
    for i, tok in enumerate(tokens):
        if re.match(r"(said|aforesaid|foresaid)$", tok.lower_):
            head = tokens[i + 1]            # the noun after "said"
            antecedent = nearest_prior_mention(head, tokens[:i], chains)
            if antecedent:
                chains.merge(head, antecedent)
    return chains

Add companions for kinship anaphora ("his wife", "her late husband") that bind to the most recent person of the right gender. These rules typically recover a large fraction of the deed-and-will links neural models drop.

How does coreference fit with disambiguation?

They are different scopes, and order matters:

Step	Scope	Output
Coreference	within one document	mention chains per entity
Disambiguation	across documents	person records linked to real individuals

Resolve coreference first so each document contributes clean chains; then disambiguation decides whether the "John Pell" chain in this deed is the same man as the "John Pell" chain in another. Skipping coreference means feeding disambiguation a noisier signal — isolated pronouns it cannot place.

How do I know it is working?

Annotate a sample of documents with gold chains and score with the standard clustering metrics — MUC, B-cubed, and CEAF, usually reported as the CoNLL average. But do not stop at the number: read the errors. The metrics will not tell you that every "aforesaid" is being missed, whereas five minutes of qualitative inspection will, and that single pattern may dominate your error budget.

Key Takeaways

Coreference links names, pronouns, and back-references into per-entity chains.
Historical legal text needs rules for "the said" and "aforesaid" anaphora.
Use a neural model for a baseline, then add targeted rules for what it misses.
Resolve coreference within documents before disambiguating across them.
Kinship phrases like "his wife" need gender-aware antecedent rules.
Score with MUC, B-cubed, and CEAF, but also inspect errors qualitatively.
OCR/HTR noise degrades coreference, so clean text pays off here too.

Frequently Asked Questions

What is coreference resolution in historical text?

It is linking every mention that refers to the same entity — a name, a pronoun, "the said John", "his wife" — into one chain. It tells you who "he" is three sentences after the name was last used.

Why is coreference harder on historical sources?

Long sentences, formulaic legal phrasing like "the aforesaid", sparse pronouns, gendered ambiguity, and OCR noise all break modern coreference models, which were trained on contemporary prose.

Do modern neural coreference models work on old text?

Partially. Tools like the spaCy coreference component or AllenNLP give a starting point, but precision drops on archaic constructions. Expect to add rules for legal anaphora such as "the said" and "aforesaid".

How does coreference relate to entity disambiguation?

Coreference links mentions within a document into chains; disambiguation links entities across documents to real individuals. You usually resolve coreference first, then feed the chains into disambiguation.

What rule helps most on legal and administrative text?

A rule that binds "the said X", "aforesaid X", and "the foresaid" back to the most recent matching antecedent recovers a large share of links that neural models miss in deeds, wills, and court rolls.

How do I evaluate coreference quality?

Use standard clustering metrics — MUC, B-cubed, and CEAF, often reported as the CoNLL average — against a hand-annotated sample. Inspect errors qualitatively too, since the metrics hide systematic anaphora failures.

What exactly does coreference resolution produce? ​

Why is historical text so hard for coreference? ​

How do I get a baseline with existing tools? ​

What rules recover the links models miss? ​

How does coreference fit with disambiguation? ​

How do I know it is working? ​

Key Takeaways ​

Frequently Asked Questions ​

What is coreference resolution in historical text? ​

Why is coreference harder on historical sources? ​

Do modern neural coreference models work on old text? ​

How does coreference relate to entity disambiguation? ​

What rule helps most on legal and administrative text? ​

How do I evaluate coreference quality? ​

Related reading ​