Troubleshooting: Extract historical organisations

When you extract historical organisations and the results disappoint, the cause is almost always one of four things: organisations tagged as persons, name-over-time drift collapsed too early, whole categories of bodies (orders, regiments, guilds) never seen in training, and precision wrecked by institutional shorthand. This guide walks each symptom to its root cause and a fix you can apply today. Treat extraction as recognising a reference on the page, and keep all merging and dating for a later resolution step.

Why are organisations tagged as PER?

Many historical firms and houses are named after people: Baring Brothers, the House of Medici, Drexel, Morgan & Co.. A model trained on modern news has a strong PER prior for capitalised personal names, so it mislabels them.

Two complementary fixes:

python

ORG_SUFFIXES = (" & Co.", " Brothers", " Bros.", " & Sons",
                " Company", " Bank", " House of", " Society of")

def promote_to_org(span_text, label):
    if label == "PER" and any(s.lower() in span_text.lower()
                              for s in ORG_SUFFIXES):
        return "ORG"
    return label

The durable fix is gold annotation: hand-label 150-200 of these constructions so the model learns the pattern rather than relying on a brittle suffix list.

How do I handle names that changed over time?

The body called the East India Company was, across two centuries, also the Governor and Company of Merchants of London trading into the East Indies. Do not merge these at extraction. Record exactly what the page says, with offsets, then resolve later against a versioned authority.

json

{"surface": "Honourable East India Company",
 "label": "ORG", "start": 1182, "end": 1211,
 "doc_date": "1798", "resolved": null}

Leaving resolved null at this stage is the point: extraction observes, resolution decides.

Why are orders, regiments and guilds missed?

Statistical models almost never saw the Society of Jesus, the 42nd Regiment of Foot or the Worshipful Company of Goldsmiths. Add a curated gazetteer and run it as a recall safety net.

Body type	Example surface forms	Why missed	Fix
Religious orders	Society of Jesus, Order of Preachers	rare in training	order gazetteer
Military units	42nd Regiment of Foot, the Grand Army	numeric + archaic	regex + unit list
Guilds	Worshipful Company of Goldsmiths	formulaic phrasing	pattern rule
Trading firms	Hudson's Bay Company	named like firm	suffix promotion

Gazetteer matches and model predictions are then merged, with the gazetteer covering the long tail the model cannot.

Should I split nested organisation names?

No — pick the longest defensible span as canonical and store inner mentions separately only if a downstream task needs them. the London office of the Hudson's Bay Company is a single organisational reference. Splitting it into two ORGs inflates counts and corrupts any network you later build.

Why does precision collapse on legal text?

Parliamentary and legal sources are dense with institutional shorthand: the Crown, the House, the Bench, the Court. These read like organisations but are often role-references. Build a genre-specific stop list and require contextual evidence before tagging a bare definite-article body.

python

GENRE_STOP = {"the crown", "the house", "the bench", "the court", "the bar"}

def filter_legal(span, prev_tokens):
    if span["text"].lower() in GENRE_STOP and "of" not in prev_tokens:
        return None  # ambiguous shorthand, drop
    return span

How should an LLM fit into this?

Use a language model for candidate generation on hard genres, where it lifts recall on rare bodies. Then verify every span is a literal substring of the page and push name normalisation into your separate resolution step. The LLM proposes; auditable code disposes.

Key Takeaways

Most ORG-as-PER errors come from firms named after people; fix with gold annotation plus a suffix-promotion rule.
Extract the surface form with offsets and a document date; defer all name merging to resolution.
A curated gazetteer of orders, regiments and guilds is the recall safety net statistical models cannot provide.
Keep the longest defensible span as canonical; do not split nested organisation references.
Genre stop lists and context rules recover precision on legal and parliamentary text.
Let LLMs propose candidates, but verify spans and resolve names in a separate, auditable step.

Frequently Asked Questions

Why does my NER model tag organisations as persons?

Historical organisations often carry a person's name (the 'House of Fugger', 'Baring Brothers'), so a modern model defaults to PER. The fix is targeted gold annotation of these patterns plus a post-hoc rule that promotes known firm suffixes to ORG.

How do I handle organisations whose names changed over time?

Extract the surface form as written, then resolve it separately against a versioned authority that records name spans with dates. Never merge 'East India Company' and its later forms at extraction time; that decision belongs to entity resolution.

Why are religious and military bodies missed entirely?

Standard models rarely see 'the Society of Jesus' or 'the 42nd Regiment of Foot' in training. Add a gazetteer of orders, regiments and guilds, and treat gazetteer hits as a recall safety net alongside the statistical model.

Should I split or keep nested organisation names?

Keep the longest defensible span as the canonical ORG and store inner spans separately if you need them. 'the London office of the Hudson's Bay Company' is one organisation reference, not two competing entities.

Why does precision drop on legal and parliamentary text?

These genres are dense with capitalised role and body names ('the Crown', 'the House', 'the Bench') that look like organisations but are often institutional shorthand. A genre-specific stop list and context rules recover precision fast.

Can an LLM extract historical organisations more reliably than a trained model?

An LLM often improves recall on rare bodies but tends to anachronistically normalise names and merge variants. Use it for candidate generation, then verify spans against the source text and resolve names in a separate, auditable step.

Why are organisations tagged as PER? ​

How do I handle names that changed over time? ​

Why are orders, regiments and guilds missed? ​

Should I split nested organisation names? ​

Why does precision collapse on legal text? ​

How should an LLM fit into this? ​

Key Takeaways ​

Frequently Asked Questions ​

Why does my NER model tag organisations as persons? ​

How do I handle organisations whose names changed over time? ​

Why are religious and military bodies missed entirely? ​

Should I split or keep nested organisation names? ​

Why does precision drop on legal and parliamentary text? ​

Can an LLM extract historical organisations more reliably than a trained model? ​

Related reading ​