Appearance
To model a correspondence network, treat each letter as a directed edge from sender to recipient, record one row per letter with a stable ID and date, then aggregate repeated exchanges into weighted edges. The hard work is not the graph software — it is disambiguating people and places so that "J. Locke" and "John Locke" become one node. Get that right and the network almost builds itself.
This guide walks the full workflow a historian uses to turn a box of letters or a catalogue export into a defensible, reproducible network model.
How should I structure the raw data?
Start from a flat letters table — one row per surviving letter — not from a graph. This keeps your evidence auditable. A practical schema:
csv
letter_id,sender_id,recipient_id,date,place_of_writing,language,confidence,archive
L0001,locke_john,clarke_edward,1690-03-12,Oates,en,high,Bodleian
L0002,locke_john,newton_isaac,1690-09-08,London,en,medium,Bodleian
L0003,unknown,locke_john,1691,,la,low,BLEvery analytical view — directed graph, weighted graph, temporal slices — is generated from this table, so the table is your single source of truth.
How do I turn letters into edges?
A correspondence edge is sender -> recipient. Because the same pair often exchange dozens of letters, you aggregate. In pandas:
python
import pandas as pd
letters = pd.read_csv("letters.csv")
edges = (letters
.groupby(["sender_id", "recipient_id"])
.size()
.reset_index(name="weight"))
edges.columns = ["Source", "Target", "Weight"]
edges.to_csv("edges.csv", index=False)The Weight column now records exchange intensity, which you map to edge thickness in Gephi. Keep the per-letter table for any temporal work.
Why is entity disambiguation the real bottleneck?
Names in catalogues are inconsistent: spelling variants, Latinised forms, titles, and shared names. Resolve them before building edges. Use OpenRefine's clustering (key collision, then fingerprint) to merge variants, then reconcile against an authority file:
- VIAF for canonical name forms and life dates.
- Wikidata Q-numbers for stable, language-independent IDs.
- A local
people.csvmapping every catalogue string to onenode_id.
Store the authority ID on the node so a reviewer can verify each merge. A single mis-merge of two people called "Thomas Smith" silently fabricates a connection that never existed.
How do I model awkward letters?
Real correspondence breaks tidy rules. Handle the common cases explicitly:
| Case | Modelling choice |
|---|---|
| Circular letter to 5 people | Five edges sharing one letter_id |
| Unknown sender | Edge from an unknown node, flagged low confidence |
| Letter via intermediary | Edge sender→recipient; note carrier in metadata |
| Enclosure within a letter | Treat as one letter unless catalogued separately |
| Undated letter | Keep in graph, exclude from temporal slices |
Making these decisions once, in writing, is what makes your network reproducible.
How do I add the time dimension?
Correspondence is inherently temporal. Add a date to each edge and build slices — for example, one network per decade — so you can watch a circle grow or fracture. In NetworkX:
python
import networkx as nx
G = nx.from_pandas_edgelist(
letters, "sender_id", "recipient_id",
edge_attr="date", create_using=nx.DiGraph)Then filter edges by date range to generate each slice. This reveals, for instance, how a scholar's network contracted after exile.
What checks should I run before trusting the network?
- Self-loops — a sender equal to recipient usually signals a data error.
- Degree outliers — a node with implausibly high degree often hides un-merged duplicates.
- Orphan recipients — recipients who never appear as senders may be a collection bias, not a finding.
- Date coverage — chart letters per year; gaps reflect survival, not silence.
What does a finished model look like?
The deliverable is three artefacts: the per-letter table, a reconciled people.csv, and the derived edges.csv, plus a short README documenting every modelling decision above. Anyone can then reproduce your Gephi figure from the raw letters — which is exactly the standard Mapping the Republic of Letters set for the field.
Key Takeaways
- Keep a per-letter table as your source of truth; derive every graph from it.
- Model letters as directed edges and aggregate repeats into weighted edges.
- Disambiguate people against VIAF or Wikidata before building edges, not after.
- Handle circular letters, unknown senders and undated items with explicit rules.
- Add dates to edges so you can build temporal slices of the network.
- Audit self-loops, degree outliers and date coverage before drawing conclusions.
- Ship the letters table, reconciled people list, edges and a README together.
Frequently Asked Questions
What counts as an edge in a correspondence network?
One letter sent from a writer to a recipient is the natural directed edge. When the same pair exchange many letters you aggregate them into a single weighted edge whose weight is the letter count, preserving direction from sender to receiver.
How do I handle letters with multiple recipients or unknown senders?
Model a circular letter to several recipients as one edge per recipient sharing the same letter ID. For unknown senders or recipients, use an explicit 'unknown' node rather than dropping the letter, so your totals stay honest and the gap is visible.
Should correspondence edges be directed or undirected?
Use directed edges because epistolary exchange has a clear sender and receiver, and direction lets you separate prolific writers from popular recipients. Collapse to undirected only when your question is purely about contact, not flow.
What metadata should I record per letter before building the graph?
Capture at minimum a letter ID, sender, recipient, date and source archive. Adding place of writing, language and a confidence flag lets you filter, build temporal slices and audit uncertain attributions later.
Which tools suit correspondence networks?
OpenRefine or pandas for cleaning and reconciliation, Gephi or Cytoscape for visualisation, and the standard EpistolaryNetwork conventions popularised by Mapping the Republic of Letters for structure. NetworkX handles temporal and statistical questions in Python.
How do I deal with undated letters?
Keep undated letters in the edge list but flag them, and exclude them from any time-sliced analysis while retaining them in the aggregate graph. Never silently assign a guessed date, as it corrupts temporal centrality and growth curves.