Skip to content
R for the Humanities

Analysing networks in R reliably comes down to three habits: build the graph from an explicit, traceable edge list; fix every random seed so results reproduce; and report metrics alongside the coverage and bias of the underlying sources. The igraph package does the heavy computation, tidygraph and ggraph give you tidy verbs and reproducible plots, and a written checklist keeps a whole collection consistent. Below is the workflow I apply to correspondence, kinship and citation networks drawn from archival material.

How do I structure the data before touching igraph?

Decide your node and edge tables before any analysis. A node table has one row per entity with a stable ID; an edge table has from, to, a weight, and a source column that points back to the archival record.

r
library(igraph)
library(tidygraph)

edges <- readr::read_csv("edges.csv")   # from, to, weight, source_doc
nodes <- readr::read_csv("nodes.csv")   # id, label, role, first_attested

g <- graph_from_data_frame(edges, vertices = nodes, directed = TRUE)

The source_doc column is the part most people skip. It is what lets you answer "which letter created this tie?" six months later, and it makes every edge defensible when a reviewer pushes back.

Which centrality measure should I actually use?

Pick the measure that matches the historical question, not the one with the prettiest distribution. Degree answers "who is documented as connected to many people"; betweenness answers "who sits on paths between groups"; eigenvector answers "who is connected to well-connected people".

MeasureQuestion it answersCostCaveat in archives
degree()Most direct tiesTrivialInflated for over-documented figures
betweenness()Brokers between clustersO(VE)Unstable on incomplete graphs
closeness()Reach across the networkNeeds connected graphUndefined on disconnected components
eigen_centrality()Embedded in influential coresEigen-solveSensitive to weighting choices

Run them on the largest connected component, not the raw graph, or closeness silently breaks.

Why are my results not reproducible?

Any step that samples — community detection, force-directed layout, random walks — uses R's RNG. Set the seed once and commit it.

r
set.seed(1789)
comm <- cluster_louvain(as.undirected(g, mode = "collapse"),
                        weights = E(g)$weight)
V(g)$community <- membership(comm)

Deterministic metrics (degree, components) are stable regardless, but cluster_louvain(), cluster_walktrap() and layout_with_fr() all wander without a fixed seed. Record the igraph version too: packageVersion("igraph").

How do I handle the survival-bias problem?

Treat the network as a sample, never a census. Surviving letters over-represent the literate, the wealthy and the institutionally connected. Report coverage explicitly: how many actors, what date range, and what fraction of expected sources survive. A node with zero edges may be a hermit or simply someone whose papers were lost — and your metrics cannot tell the difference.

A reusable quality checklist

Run this before reporting any figure from a graph:

  • Edge list carries a source_doc back-reference for every tie.
  • Node merges are logged in a versioned reconciliation table.
  • set.seed() is set and committed; igraph version recorded.
  • Metrics computed on the giant component, with disconnected nodes reported separately.
  • Weights used wherever multiple sources link the same pair.
  • Coverage and survival bias stated in the caption or methods note.
  • Plot exported with the seed and layout function named in the script.

Key Takeaways

  • Build graphs from an explicit edge list that traces back to archival records.
  • Use igraph for computation, tidygraph/ggraph for tidy manipulation and plotting.
  • Match the centrality measure to the historical question, not the data shape.
  • Fix set.seed() so community detection and layouts reproduce exactly.
  • Compute closeness and friends on the giant component to avoid undefined values.
  • Always report coverage and survival bias — centrality reflects documentation.
  • Keep weights when multiple sources link the same pair of actors.

Frequently Asked Questions

Should I use igraph or tidygraph for historical network analysis?

Use igraph for the maths (centrality, components, community detection) and tidygraph plus ggraph when you want dplyr-style verbs and reproducible plots. They share the same underlying object, so you can convert freely with as_tbl_graph() and as.igraph().

How do I record which records produced each edge?

Keep an edge attribute that points back to the source — a document ID, folio reference or letter UID. Never collapse two people into one node without logging the merge in a versioned reconciliation table.

Why do my centrality scores change every time I run the script?

Stochastic steps (community detection, force-directed layouts, random walks) need a fixed seed. Call set.seed() once at the top and version-control it; deterministic metrics like degree never change but cluster_louvain() will.

How big a network can igraph handle on a laptop?

igraph is C-backed and comfortably handles hundreds of thousands of edges. Most archival correspondence or kinship networks are well under 50,000 edges, so memory is rarely the bottleneck — interpretation is.

What is the single most common mistake in humanities network analysis?

Treating a sparse, biased sample of surviving sources as a complete network. Centrality reflects who is well-documented, not necessarily who was important, so always report coverage and survival bias.

Do I need to weight edges?

If multiple letters or transactions link the same pair, store the count as a weight rather than collapsing to a single unweighted edge. Many igraph functions accept a weights argument and ignoring it discards real signal.