Skip to content
Named Entities in History

Relation extraction means finding how two entities are connected and naming that connection — turning "John Evelyn, born at Wotton" into the triple (John Evelyn, born-in, Wotton). Where named-entity recognition finds the people, places and organisations, relation extraction adds the verbs between them, and that is what lets you build genealogies, correspondence networks and prosopographies. This guide assumes no prior background and walks a small worked example end to end.

How is relation extraction different from NER?

NER answers "what entities are in this text?" Relation extraction answers "how are they related?" They stack:

text
NER:       [John Evelyn]PER , born at [Wotton]PLACE
Relations: (John Evelyn, born-in, Wotton)

You almost always run NER first, because a relation needs two entity endpoints. Trying to extract relations from raw, untagged text is the most common beginner misstep.

What does a relation look like as data?

The standard shape is a directed triple: subject, predicate, object.

json
{"subject": "John Evelyn",
 "predicate": "born-in",
 "object": "Wotton",
 "source": "diary_1620.txt:142",
 "evidence": "John Evelyn, born at Wotton"}

Keep the source and evidence fields from day one. A relation you cannot trace back to a sentence is a relation a reviewer cannot trust.

Which relation types should I define first?

Resist a giant schema. Start with three to six relations your research question actually needs:

PredicateExample evidenceTypical use
born-in"born at Wotton"biography, mapping
parent-of"son of Richard"genealogy
member-of"Fellow of the Royal Society"institutional networks
governed"Governor of Bombay"prosopography
corresponded-with"in a letter to Boyle"letter networks

A small schema annotated consistently beats a sprawling one applied unevenly.

What is the simplest method that actually works?

Start with patterns, not machine learning. Many historical relations sit in formulaic phrasing — X, son of Y; X, vicar of Z — that a handful of rules capture reliably.

python
import re

PATTERNS = [
    (r"(?P<s>[A-Z]\w+(?: \w+)?), son of (?P<o>[A-Z]\w+(?: \w+)?)", "parent-of"),
    (r"(?P<s>[A-Z]\w+(?: \w+)?), born at (?P<o>[A-Z]\w+)", "born-in"),
]

def extract(sentence):
    out = []
    for pat, pred in PATTERNS:
        for m in re.finditer(pat, sentence):
            # parent-of: object is the parent, so subject/object map carefully
            out.append({"subject": m.group("s"),
                        "predicate": pred,
                        "object": m.group("o")})
    return out

This rule baseline is your yardstick. Only move to a dependency parse or a trained classifier when patterns demonstrably miss too much.

Why does direction matter so much?

A relation is ordered. (Richard, parent-of, John) and (John, parent-of, Richard) are opposite claims, and swapping them inverts an entire family tree. Always store subject-predicate-object in fixed order; never collapse a relation to an undirected "these two are linked".

How do I follow a worked example end to end?

Take the sentence: "Robert Boyle, son of the Earl of Cork, corresponded with John Evelyn."

  1. NER yields Robert Boyle (PER), Earl of Cork (PER/ROLE), John Evelyn (PER).
  2. Patterns match X, son of Y and the corresponded with cue.
  3. Triples produced:
    • (Robert Boyle, child-of, Earl of Cork)
    • (Robert Boyle, corresponded-with, John Evelyn)
  4. Evidence stored against each, pointing back to the sentence.

You now have two edges ready to load into a network graph.

How do I check the results are any good?

Sample 50-100 extracted triples and verify each against its source sentence by hand. Report precision per relation type — not one global number — and look specifically for two failures: direction errors, and relations asserted between entities the sentence never actually links (a frequent artefact of long sentences with many entities).

Key Takeaways

  • Relation extraction names the connection between two entities, producing directed subject-predicate-object triples.
  • Run reliable NER first; relations need entity endpoints to connect.
  • Store source and evidence with every triple so each claim is traceable.
  • Begin with three to six relation types your research question genuinely needs.
  • A small set of patterns over formulaic phrasing is a strong, honest baseline before any machine learning.
  • Direction is load-bearing; store ordered triples and never collapse relations to undirected links.
  • Validate by hand-checking 50-100 triples and reporting precision per relation type.

Frequently Asked Questions

What is relation extraction, in plain terms?

It is finding how two entities are connected and naming that connection. Where NER tells you a text mentions a person and a place, relation extraction tells you the person 'was born in' or 'governed' that place, producing a subject-predicate-object triple.

Do I need NER before relation extraction?

Almost always, yes. Relations connect entities, so you first need reliable entity spans. A common beginner mistake is trying to extract relations from raw text before the people, places and organisations have been recognised and resolved.

What is the simplest method that works?

A small fixed set of patterns over a dependency parse or over the text between two entities. Phrases like 'X, son of Y' or 'X, vicar of Z' map directly to relations and give a reliable baseline before you reach for machine learning.

How many relation types should a beginner define?

Start with three to six that your research question genuinely needs, such as born-in, parent-of, member-of and governed. A small, well-defined schema annotated consistently beats a sprawling one applied unevenly.

Why are my relations directional and does it matter?

Direction is essential: 'parent-of' and 'child-of' are opposite triples, and swapping them corrupts any genealogy or network you build. Always store relations as ordered subject-predicate-object, never as an undirected link.

How do I know if my extracted relations are any good?

Sample 50-100 extracted triples and check each against the source sentence by hand. Report precision per relation type, and watch especially for direction errors and relations asserted between entities that the sentence never actually links.