Best Practices to Analyse networks in R

Analysing networks in R reliably comes down to three habits: build the graph from an explicit, traceable edge list; fix every random seed so results reproduce; and report metrics alongside the coverage and bias of the underlying sources. The igraph package does the heavy computation, tidygraph and ggraph give you tidy verbs and reproducible plots, and a written checklist keeps a whole collection consistent. Below is the workflow I apply to correspondence, kinship and citation networks drawn from archival material.

How do I structure the data before touching igraph?

Decide your node and edge tables before any analysis. A node table has one row per entity with a stable ID; an edge table has from, to, a weight, and a source column that points back to the archival record.

library(igraph)
library(tidygraph)

edges <- readr::read_csv("edges.csv")   # from, to, weight, source_doc
nodes <- readr::read_csv("nodes.csv")   # id, label, role, first_attested

g <- graph_from_data_frame(edges, vertices = nodes, directed = TRUE)

The source_doc column is the part most people skip. It is what lets you answer "which letter created this tie?" six months later, and it makes every edge defensible when a reviewer pushes back.

Which centrality measure should I actually use?

Pick the measure that matches the historical question, not the one with the prettiest distribution. Degree answers "who is documented as connected to many people"; betweenness answers "who sits on paths between groups"; eigenvector answers "who is connected to well-connected people".

Measure	Question it answers	Cost	Caveat in archives
`degree()`	Most direct ties	Trivial	Inflated for over-documented figures
`betweenness()`	Brokers between clusters	`O(VE)`	Unstable on incomplete graphs
`closeness()`	Reach across the network	Needs connected graph	Undefined on disconnected components
`eigen_centrality()`	Embedded in influential cores	Eigen-solve	Sensitive to weighting choices

Run them on the largest connected component, not the raw graph, or closeness silently breaks.

Why are my results not reproducible?

Any step that samples — community detection, force-directed layout, random walks — uses R's RNG. Set the seed once and commit it.

set.seed(1789)
comm <- cluster_louvain(as.undirected(g, mode = "collapse"),
                        weights = E(g)$weight)
V(g)$community <- membership(comm)

Deterministic metrics (degree, components) are stable regardless, but cluster_louvain(), cluster_walktrap() and layout_with_fr() all wander without a fixed seed. Record the igraph version too: packageVersion("igraph").

How do I handle the survival-bias problem?

Treat the network as a sample, never a census. Surviving letters over-represent the literate, the wealthy and the institutionally connected. Report coverage explicitly: how many actors, what date range, and what fraction of expected sources survive. A node with zero edges may be a hermit or simply someone whose papers were lost — and your metrics cannot tell the difference.

A reusable quality checklist

Run this before reporting any figure from a graph:

Edge list carries a source_doc back-reference for every tie.
Node merges are logged in a versioned reconciliation table.
set.seed() is set and committed; igraph version recorded.
Metrics computed on the giant component, with disconnected nodes reported separately.
Weights used wherever multiple sources link the same pair.
Coverage and survival bias stated in the caption or methods note.
Plot exported with the seed and layout function named in the script.

Key Takeaways

Build graphs from an explicit edge list that traces back to archival records.
Use igraph for computation, tidygraph/ggraph for tidy manipulation and plotting.
Match the centrality measure to the historical question, not the data shape.
Fix set.seed() so community detection and layouts reproduce exactly.
Compute closeness and friends on the giant component to avoid undefined values.
Always report coverage and survival bias — centrality reflects documentation.
Keep weights when multiple sources link the same pair of actors.

Frequently Asked Questions

Should I use igraph or tidygraph for historical network analysis?

Use igraph for the maths (centrality, components, community detection) and tidygraph plus ggraph when you want dplyr-style verbs and reproducible plots. They share the same underlying object, so you can convert freely with as_tbl_graph() and as.igraph().

How do I record which records produced each edge?

Keep an edge attribute that points back to the source — a document ID, folio reference or letter UID. Never collapse two people into one node without logging the merge in a versioned reconciliation table.

Why do my centrality scores change every time I run the script?

Stochastic steps (community detection, force-directed layouts, random walks) need a fixed seed. Call set.seed() once at the top and version-control it; deterministic metrics like degree never change but cluster_louvain() will.

How big a network can igraph handle on a laptop?

igraph is C-backed and comfortably handles hundreds of thousands of edges. Most archival correspondence or kinship networks are well under 50,000 edges, so memory is rarely the bottleneck — interpretation is.

What is the single most common mistake in humanities network analysis?

Treating a sparse, biased sample of surviving sources as a complete network. Centrality reflects who is well-documented, not necessarily who was important, so always report coverage and survival bias.

Do I need to weight edges?

If multiple letters or transactions link the same pair, store the count as a weight rather than collapsing to a single unweighted edge. Many igraph functions accept a weights argument and ignoring it discards real signal.

How do I structure the data before touching igraph? ​

Which centrality measure should I actually use? ​

Why are my results not reproducible? ​

How do I handle the survival-bias problem? ​

A reusable quality checklist ​

Key Takeaways ​

Frequently Asked Questions ​

Should I use igraph or tidygraph for historical network analysis? ​

How do I record which records produced each edge? ​

Why do my centrality scores change every time I run the script? ​

How big a network can igraph handle on a laptop? ​

What is the single most common mistake in humanities network analysis? ​

Do I need to weight edges? ​

Related reading ​