Skip to content
Network Analysis of Sources

When NetworkX gives you wrong results on historical data, the cause is almost always identity and loading — not the algorithms. The three errors that bite historians most often are node-label collisions (two people merged into one), silently overwritten edge weights, and KeyError on missing attributes. Fix those at load time and most "weird centrality" mysteries disappear. Here is how to diagnose and repair each.

Why does NetworkX merge people I meant to keep separate?

In NetworkX the node label is the identity. If you load names as labels, "Iohannes Smyth" and "John Smith" are two nodes while two unrelated men both written "Iohannes" collapse into one. This is the single biggest source of corrupted historical networks.

The fix is to never use names as identity. Assign a stable unique ID and carry the name as an attribute:

python
import networkx as nx
G = nx.Graph()
G.add_node("P0317", label="Iohannes Smyth", role="merchant")
G.add_node("P0318", label="Iohannes Smyth", role="cleric")  # different man, safe

Build the ID map during record linkage, before the graph ever exists.

Why are my edge weights being silently overwritten?

A plain Graph keeps at most one edge per pair. Call add_edge("A","B") twice and the second call overwrites the first — so ten letters between two correspondents become weight 1, not 10. Diagnose it by checking G.number_of_edges() against your raw row count: a big gap means collapsed duplicates.

Aggregate first, then load:

python
import pandas as pd
agg = (df.groupby(["source", "target"])
         .size().reset_index(name="weight"))
G = nx.from_pandas_edgelist(agg, "source", "target", edge_attr="weight")

If every individual letter must survive (e.g. for dated analysis), use a MultiGraph instead, which permits parallel edges.

How do I fix a KeyError on node attributes?

A KeyError when reading G.nodes[n]["place"] means that node never received the attribute — typically an isolated node added implicitly by an edge. Two robust patterns:

python
# 1. set a default for every node up front
nx.set_node_attributes(G, "unknown", "place")
# 2. or read defensively
place = G.nodes[n].get("place", "unknown")

Set defaults immediately after construction so downstream code never has to guess.

Why is betweenness centrality so slow?

Exact betweenness centrality runs in roughly O(VE) time and becomes painful above a few thousand nodes. Symptoms: a script that hangs for minutes on a graph that draws instantly. Two fixes:

python
# sample k source nodes for an approximation
bc = nx.betweenness_centrality(G, k=500, seed=42)

# always run on the largest component, not the whole disconnected graph
giant = G.subgraph(max(nx.connected_components(G), key=len)).copy()

Approximation with k is usually within a few percent of exact ranks and orders of magnitude faster.

Diagnosing phantom nodes after loading

Phantom nodes (empty strings, stray whitespace, NaN) are the second-most-common load defect. Quick triage:

python
print("nodes:", G.number_of_nodes())
print("suspect:", [n for n in G.nodes if not str(n).strip()])
print("isolates:", list(nx.isolates(G))[:10])

Clean source and target columns — strip whitespace, drop empty IDs, normalise case — before from_pandas_edgelist, never after.

Directed letters vs undirected co-presence

Historians often want both a directional view (who wrote to whom) and a symmetric view (who appears together). You cannot mix these in one graph. Build two:

ViewObjectQuestion it answers
Letters sentDiGraphflow, reciprocity, in/out degree
Co-presenceGraphwho clusters with whom

Compute metrics on each and compare; never force directionality onto a co-presence tie or vice versa.

Key Takeaways

  • Use stable unique IDs as node labels and store names as attributes to prevent merges and splits.
  • Aggregate duplicate ties into weights before loading, or use a MultiGraph when every instance matters.
  • Set attribute defaults right after construction to avoid KeyError downstream.
  • Speed up betweenness with the k sampling parameter and always run on the largest component.
  • Clean and trim source/target columns before from_pandas_edgelist to kill phantom nodes.
  • Keep directed and undirected relationships in separate graph objects.

Frequently Asked Questions

Why does NetworkX merge people I meant to keep separate?

NetworkX uses node labels as identity, so two different people sharing a name spelling become one node. Give every person a stable unique ID and store the display name as a node attribute instead.

Why are my edge weights being silently overwritten?

Calling add_edge twice on the same pair replaces the previous edge in a plain Graph. Aggregate duplicate ties into a weight before loading, or use a MultiGraph if every instance must survive.

How do I fix a KeyError when reading node attributes?

A KeyError means the attribute was never set on that node, often because an isolated node entered the graph without it. Set defaults with nx.set_node_attributes before you read, or guard with G.nodes[n].get('attr').

Why is betweenness centrality so slow on my graph?

Exact betweenness is O(VE) and crawls on graphs above a few thousand nodes. Use the k parameter to sample sources, or switch to a faster approximation, and always run on the largest connected component.

How do I load a historical edge list correctly?

Use nx.from_pandas_edgelist with explicit source, target and edge_attr arguments after cleaning IDs. Loading directly from a messy CSV is the most common source of phantom nodes.

Can NetworkX handle directed correspondence and undirected co-presence together?

Not in one object. Build a DiGraph for directional flows like letters sent, and a separate Graph for symmetric co-presence, then compare metrics rather than mixing both in a single graph.