Appearance
Analysing networks in R reliably comes down to three habits: build the graph from an explicit, traceable edge list; fix every random seed so results reproduce; and report metrics alongside the coverage and bias of the underlying sources. The igraph package does the heavy computation, tidygraph and ggraph give you tidy verbs and reproducible plots, and a written checklist keeps a whole collection consistent. Below is the workflow I apply to correspondence, kinship and citation networks drawn from archival material.
How do I structure the data before touching igraph?
Decide your node and edge tables before any analysis. A node table has one row per entity with a stable ID; an edge table has from, to, a weight, and a source column that points back to the archival record.
r
library(igraph)
library(tidygraph)
edges <- readr::read_csv("edges.csv") # from, to, weight, source_doc
nodes <- readr::read_csv("nodes.csv") # id, label, role, first_attested
g <- graph_from_data_frame(edges, vertices = nodes, directed = TRUE)The source_doc column is the part most people skip. It is what lets you answer "which letter created this tie?" six months later, and it makes every edge defensible when a reviewer pushes back.
Which centrality measure should I actually use?
Pick the measure that matches the historical question, not the one with the prettiest distribution. Degree answers "who is documented as connected to many people"; betweenness answers "who sits on paths between groups"; eigenvector answers "who is connected to well-connected people".
| Measure | Question it answers | Cost | Caveat in archives |
|---|---|---|---|
degree() | Most direct ties | Trivial | Inflated for over-documented figures |
betweenness() | Brokers between clusters | O(VE) | Unstable on incomplete graphs |
closeness() | Reach across the network | Needs connected graph | Undefined on disconnected components |
eigen_centrality() | Embedded in influential cores | Eigen-solve | Sensitive to weighting choices |
Run them on the largest connected component, not the raw graph, or closeness silently breaks.
Why are my results not reproducible?
Any step that samples — community detection, force-directed layout, random walks — uses R's RNG. Set the seed once and commit it.
r
set.seed(1789)
comm <- cluster_louvain(as.undirected(g, mode = "collapse"),
weights = E(g)$weight)
V(g)$community <- membership(comm)Deterministic metrics (degree, components) are stable regardless, but cluster_louvain(), cluster_walktrap() and layout_with_fr() all wander without a fixed seed. Record the igraph version too: packageVersion("igraph").
How do I handle the survival-bias problem?
Treat the network as a sample, never a census. Surviving letters over-represent the literate, the wealthy and the institutionally connected. Report coverage explicitly: how many actors, what date range, and what fraction of expected sources survive. A node with zero edges may be a hermit or simply someone whose papers were lost — and your metrics cannot tell the difference.
A reusable quality checklist
Run this before reporting any figure from a graph:
- Edge list carries a
source_docback-reference for every tie. - Node merges are logged in a versioned reconciliation table.
set.seed()is set and committed; igraph version recorded.- Metrics computed on the giant component, with disconnected nodes reported separately.
- Weights used wherever multiple sources link the same pair.
- Coverage and survival bias stated in the caption or methods note.
- Plot exported with the seed and layout function named in the script.
Key Takeaways
- Build graphs from an explicit edge list that traces back to archival records.
- Use igraph for computation, tidygraph/ggraph for tidy manipulation and plotting.
- Match the centrality measure to the historical question, not the data shape.
- Fix
set.seed()so community detection and layouts reproduce exactly. - Compute closeness and friends on the giant component to avoid undefined values.
- Always report coverage and survival bias — centrality reflects documentation.
- Keep weights when multiple sources link the same pair of actors.
Frequently Asked Questions
Should I use igraph or tidygraph for historical network analysis?
Use igraph for the maths (centrality, components, community detection) and tidygraph plus ggraph when you want dplyr-style verbs and reproducible plots. They share the same underlying object, so you can convert freely with as_tbl_graph() and as.igraph().
How do I record which records produced each edge?
Keep an edge attribute that points back to the source — a document ID, folio reference or letter UID. Never collapse two people into one node without logging the merge in a versioned reconciliation table.
Why do my centrality scores change every time I run the script?
Stochastic steps (community detection, force-directed layouts, random walks) need a fixed seed. Call set.seed() once at the top and version-control it; deterministic metrics like degree never change but cluster_louvain() will.
How big a network can igraph handle on a laptop?
igraph is C-backed and comfortably handles hundreds of thousands of edges. Most archival correspondence or kinship networks are well under 50,000 edges, so memory is rarely the bottleneck — interpretation is.
What is the single most common mistake in humanities network analysis?
Treating a sparse, biased sample of surviving sources as a complete network. Centrality reflects who is well-documented, not necessarily who was important, so always report coverage and survival bias.
Do I need to weight edges?
If multiple letters or transactions link the same pair, store the count as a weight rather than collapsing to a single unweighted edge. Many igraph functions accept a weights argument and ignoring it discards real signal.