How to Avoid historical network data pitfalls

The fastest way to avoid historical network data pitfalls is to fix four things before you compute anything: survivorship bias, entity resolution, an explicit network boundary, and honest treatment of missing ties. Most "surprising" findings in historical network analysis turn out to be artefacts of one of these. This step-by-step guide gives practical defaults for each so you reach a usable, defensible result.

Step 1 — Confront survivorship bias first

The deepest trap: absence of evidence is not evidence of absence. A missing edge almost always means a lost letter, an unkept record, or a source that never existed — not that two people had no relationship. Because better-documented figures accrue more edges, your highest-centrality nodes may simply be the best-recorded ones.

Practical check before you trust any ranking:

python

import pandas as pd
# does centrality just track how many sources mention each person?
deg = pd.Series(dict(G.degree()))
src = source_counts            # documents mentioning each node
print(deg.corr(src.reindex(deg.index)))   # high corr => suspicious

A strong correlation is a red flag that you are measuring documentation, not history.

Step 2 — Resolve entities before building the graph

If "Wm. Cecil", "William Cecil" and "Lord Burghley" become three nodes, one person fragments and centrality collapses. If two different Thomas Wyatts merge, you fabricate ties. Resolve identities first, using a stable ID per person (see record-linkage tooling), and only then construct edges. Never let raw name strings be node identities.

Step 3 — Define and document the network boundary

Every historical network has an edge — literally. Who is in? A clear, mechanical rule prevents endless drift:

Boundary type	Example rule
Source-based	everyone in this letterbook
Attribute-based	members of the guild, 1500-1550
Snowball	ego plus alters named twice or more

Write the rule down. Two analysts applying different unstated boundaries to the same archive will reach different conclusions and never know why.

Step 4 — Decide how to handle isolates and missing ties

Isolated nodes (no edges) may be genuine social facts or artefacts of your boundary. Decide consciously whether to keep them and state it, because dropping isolates silently changes density and component counts. For missing ties, prefer reporting coverage ("we observe roughly X% of expected interactions") over imputing relationships you cannot evidence.

How do I avoid over-reading the picture?

A force-directed diagram arranges nodes for legibility, not meaning. Proximity on screen is not historical closeness, and a central-looking position is not centrality. Anchor every claim to a computed number — degree, betweenness, modularity — and treat the layout as illustration only. If you cannot state the metric behind a sentence, do not write the sentence.

How do I make the whole analysis defensible?

Defensibility comes from documentation plus one sensitivity test. Record your boundary rule, entity-resolution decisions, weight definition, and source coverage in a methods note. Then change a single assumption — widen the boundary, swap a debatable identity merge — and re-run. If your headline finding survives, it is robust; if it flips, you have found the real limit of your evidence.

text

methods-note checklist
[ ] boundary rule stated
[ ] entity-resolution log kept
[ ] weight definition documented
[ ] source coverage estimated
[ ] one sensitivity re-run reported

Key Takeaways

Treat missing edges as missing data, not as real absence of ties.
Check whether centrality merely tracks source volume before interpreting rankings.
Resolve entities to stable IDs before building the graph, never using raw names as identity.
Define and write down an explicit, mechanical network boundary.
Handle isolates and missing ties deliberately, and report coverage instead of imputing.
Never read meaning into layout position; anchor claims to computed metrics.
Document assumptions and run one sensitivity test to prove robustness.

Frequently Asked Questions

What is the single biggest pitfall in historical network analysis?

Treating absence of evidence as evidence of absence — a missing edge usually means a lost or never-created source, not a relationship that did not exist. This survivorship bias distorts almost every downstream metric.

How does survivorship bias affect network metrics?

Better-documented people gain artificially high degree and centrality, so your most 'important' nodes may simply be the best-recorded ones. Always check whether centrality correlates with source volume before interpreting it.

Why is entity resolution a network pitfall?

If the same person appears under several name spellings, one real node splits into many, fragmenting the network; if different people merge, ties get fabricated. Resolve identities before building the graph.

Should I include isolated nodes?

Decide deliberately and document it. Isolates can be real social facts or just artefacts of how you drew the boundary, and silently dropping them changes density and component counts.

How do I avoid over-interpreting a network diagram?

Remember that layout position carries no inherent meaning — proximity in a force-directed plot is not historical closeness. Anchor every claim to a computed metric, not to where nodes happen to land.

How can I make my network analysis defensible?

Document your boundary rule, entity-resolution decisions, weight definition and source coverage, and run a sensitivity check by varying one assumption. Reproducibility, not a pretty graph, is what makes it hold up.

Step 1 — Confront survivorship bias first ​

Step 2 — Resolve entities before building the graph ​

Step 3 — Define and document the network boundary ​

Step 4 — Decide how to handle isolates and missing ties ​

How do I avoid over-reading the picture? ​

How do I make the whole analysis defensible? ​

Key Takeaways ​

Frequently Asked Questions ​

What is the single biggest pitfall in historical network analysis? ​

How does survivorship bias affect network metrics? ​

Why is entity resolution a network pitfall? ​

Should I include isolated nodes? ​

How do I avoid over-interpreting a network diagram? ​

How can I make my network analysis defensible? ​

Related reading ​