Skip to content
Named Entities in History

When you extract historical organisations and the results disappoint, the cause is almost always one of four things: organisations tagged as persons, name-over-time drift collapsed too early, whole categories of bodies (orders, regiments, guilds) never seen in training, and precision wrecked by institutional shorthand. This guide walks each symptom to its root cause and a fix you can apply today. Treat extraction as recognising a reference on the page, and keep all merging and dating for a later resolution step.

Why are organisations tagged as PER?

Many historical firms and houses are named after people: Baring Brothers, the House of Medici, Drexel, Morgan & Co.. A model trained on modern news has a strong PER prior for capitalised personal names, so it mislabels them.

Two complementary fixes:

python
ORG_SUFFIXES = (" & Co.", " Brothers", " Bros.", " & Sons",
                " Company", " Bank", " House of", " Society of")

def promote_to_org(span_text, label):
    if label == "PER" and any(s.lower() in span_text.lower()
                              for s in ORG_SUFFIXES):
        return "ORG"
    return label

The durable fix is gold annotation: hand-label 150-200 of these constructions so the model learns the pattern rather than relying on a brittle suffix list.

How do I handle names that changed over time?

The body called the East India Company was, across two centuries, also the Governor and Company of Merchants of London trading into the East Indies. Do not merge these at extraction. Record exactly what the page says, with offsets, then resolve later against a versioned authority.

json
{"surface": "Honourable East India Company",
 "label": "ORG", "start": 1182, "end": 1211,
 "doc_date": "1798", "resolved": null}

Leaving resolved null at this stage is the point: extraction observes, resolution decides.

Why are orders, regiments and guilds missed?

Statistical models almost never saw the Society of Jesus, the 42nd Regiment of Foot or the Worshipful Company of Goldsmiths. Add a curated gazetteer and run it as a recall safety net.

Body typeExample surface formsWhy missedFix
Religious ordersSociety of Jesus, Order of Preachersrare in trainingorder gazetteer
Military units42nd Regiment of Foot, the Grand Armynumeric + archaicregex + unit list
GuildsWorshipful Company of Goldsmithsformulaic phrasingpattern rule
Trading firmsHudson's Bay Companynamed like firmsuffix promotion

Gazetteer matches and model predictions are then merged, with the gazetteer covering the long tail the model cannot.

Should I split nested organisation names?

No — pick the longest defensible span as canonical and store inner mentions separately only if a downstream task needs them. the London office of the Hudson's Bay Company is a single organisational reference. Splitting it into two ORGs inflates counts and corrupts any network you later build.

Parliamentary and legal sources are dense with institutional shorthand: the Crown, the House, the Bench, the Court. These read like organisations but are often role-references. Build a genre-specific stop list and require contextual evidence before tagging a bare definite-article body.

python
GENRE_STOP = {"the crown", "the house", "the bench", "the court", "the bar"}

def filter_legal(span, prev_tokens):
    if span["text"].lower() in GENRE_STOP and "of" not in prev_tokens:
        return None  # ambiguous shorthand, drop
    return span

How should an LLM fit into this?

Use a language model for candidate generation on hard genres, where it lifts recall on rare bodies. Then verify every span is a literal substring of the page and push name normalisation into your separate resolution step. The LLM proposes; auditable code disposes.

Key Takeaways

  • Most ORG-as-PER errors come from firms named after people; fix with gold annotation plus a suffix-promotion rule.
  • Extract the surface form with offsets and a document date; defer all name merging to resolution.
  • A curated gazetteer of orders, regiments and guilds is the recall safety net statistical models cannot provide.
  • Keep the longest defensible span as canonical; do not split nested organisation references.
  • Genre stop lists and context rules recover precision on legal and parliamentary text.
  • Let LLMs propose candidates, but verify spans and resolve names in a separate, auditable step.

Frequently Asked Questions

Why does my NER model tag organisations as persons?

Historical organisations often carry a person's name (the 'House of Fugger', 'Baring Brothers'), so a modern model defaults to PER. The fix is targeted gold annotation of these patterns plus a post-hoc rule that promotes known firm suffixes to ORG.

How do I handle organisations whose names changed over time?

Extract the surface form as written, then resolve it separately against a versioned authority that records name spans with dates. Never merge 'East India Company' and its later forms at extraction time; that decision belongs to entity resolution.

Why are religious and military bodies missed entirely?

Standard models rarely see 'the Society of Jesus' or 'the 42nd Regiment of Foot' in training. Add a gazetteer of orders, regiments and guilds, and treat gazetteer hits as a recall safety net alongside the statistical model.

Should I split or keep nested organisation names?

Keep the longest defensible span as the canonical ORG and store inner spans separately if you need them. 'the London office of the Hudson's Bay Company' is one organisation reference, not two competing entities.

These genres are dense with capitalised role and body names ('the Crown', 'the House', 'the Bench') that look like organisations but are often institutional shorthand. A genre-specific stop list and context rules recover precision fast.

Can an LLM extract historical organisations more reliably than a trained model?

An LLM often improves recall on rare bodies but tends to anachronistically normalise names and merge variants. Use it for candidate generation, then verify spans against the source text and resolve names in a separate, auditable step.