Skip to content
Research Data Curation

To anonymise sensitive historical data, identify every direct and indirect identifier, decide field-by-field whether to remove, generalise, or pseudonymise it, and record each decision in a log kept separately from any reversible key. Anonymisation is only meaningful when re-identification is no longer reasonably likely given the data you publish plus the data already public — so the standard is contextual, not a fixed recipe. The hardest part is rarely names; it is the combination of birthplace, occupation and date that makes one row unique.

When do historical records actually need anonymising?

Data-protection law (the UK GDPR, the EU GDPR) protects identifiable living individuals. A baptism register from 1812 falls outside it; a 1980s patient-admission ledger or a recent oral-history interview does not. Use a simple test: could any data subject plausibly still be alive, and is anyone identifiable? If yes to both, treat the dataset as personal data and anonymise before open deposit. When in doubt, default to caution — funders and repositories increasingly expect a documented decision either way.

Direct vs indirect identifiers

Direct identifiers (name, address, NHS number) name a person outright. Quasi-identifiers — postcode, occupation, date of birth, rare diagnosis — identify in combination. Latanya Sweeney's classic result showed that 87% of 1990 US residents were uniquely identified by ZIP, sex and full date of birth. Removing names alone is therefore almost never enough.

Identifier typeExampleTypical treatment
DirectFull nameRemove or pseudonymise
Strong quasiExact date of birthGeneralise to year / band
GeographicFull postcodeTruncate to district
Categorical (rare)Unusual occupationGroup into broader class
Free textInterview transcriptManual review or restrict access

A working step-by-step

text
1. Inventory every column; tag each as direct / quasi / sensitive / safe.
2. Decide a policy per tag (remove, generalise, pseudonymise, retain).
3. Apply transformations in a SCRIPT, not by hand, so it is repeatable.
4. Measure residual risk (k-anonymity on the quasi-identifier set).
5. Review free-text and image fields separately.
6. Log every decision; store any key offline and access-controlled.

A minimal generalisation in Python with pandas:

python
import pandas as pd

df = pd.read_csv("admissions.csv")
# Generalise date of birth to a 5-year band
df["birth_band"] = (df["dob"].str[:4].astype(int) // 5 * 5).astype(str) + "s"
df = df.drop(columns=["dob", "full_name", "postcode_full"])
df["postcode_area"] = df["postcode_full_orig"].str.extract(r"^([A-Z]{1,2}\d)")

How do I check the result is actually anonymous?

Measure k-anonymity over your quasi-identifier set: group by those columns and find the smallest group size. Any group of size 1 is a re-identification risk.

python
quasi = ["birth_band", "sex", "postcode_area", "occupation_group"]
sizes = df.groupby(quasi).size()
print("Smallest equivalence class (k):", sizes.min())
print("Unique (k=1) rows:", (sizes == 1).sum())

If k is below 5, generalise further or suppress the offending rows. Tools like ARX or pycanon automate this and add l-diversity checks for sensitive attributes.

What about free text and images?

Transcripts, letters and photographs leak identity through context, not just explicit names. Automated NER (spaCy, Stanza) catches person and place mentions but misses unique events ("the only midwife in the village"). Treat high-risk free text as a candidate for restricted access rather than open release — a mediated deposit with a data-access agreement is often more honest than aggressive redaction that destroys research value.

Pseudonymisation is not anonymisation

Replacing "Mary Adams" with SUBJ_0481 is pseudonymisation: reversible if you hold the key, and still personal data under the GDPR. It is a legitimate tool for linking records across files, but the key must live offline, encrypted, with documented access. Never embed the mapping in the same repository as the data.

Key Takeaways

  • The legal trigger is identifiable living people; most pre-20th-century data is exempt but document the decision.
  • Names are the easy part — quasi-identifier combinations drive re-identification.
  • Generalise dates and places rather than deleting them to preserve analytical value.
  • Verify with a k-anonymity check; aim for k ≥ 5 on tabular data.
  • Apply transformations via a script so the process is reproducible and auditable.
  • Free text and images often need access control, not redaction.
  • Keep a separate anonymisation log; store any pseudonym key offline.

Frequently Asked Questions

Is anonymisation legally required for historical records?

It depends on whether records concern living people. The UK GDPR and most data-protection regimes apply only to identifiable living individuals, so 19th-century census data is usually out of scope, but recent oral histories or medical registers are not.

What is the difference between anonymisation and pseudonymisation?

Anonymisation removes identifiers irreversibly so re-identification is no longer reasonably possible. Pseudonymisation replaces identifiers with a key you keep separately, so it is reversible and still counts as personal data under the GDPR.

How do I anonymise dates of birth without breaking my analysis?

Generalise rather than delete: convert exact dates to a year or a five-year band, or compute age at an event instead of storing the birth date. This preserves cohort analysis while removing a strong quasi-identifier.

What is k-anonymity and do I need it?

k-anonymity means every combination of quasi-identifiers appears at least k times in your dataset, so no row is unique. For tabular historical data a k of 5 is a common floor; small parish datasets may need aggressive generalisation to reach it.

Can free text be safely anonymised?

Only with manual review or trained NER, and even then with caution. Letters and testimonies leak identity through context, dialect and unique events, so flag free-text fields as high-risk and consider access-controlled deposit instead.

How should I document what I anonymised?

Keep an anonymisation log recording each field, the transformation applied, and the rationale, separate from any pseudonym key. This makes the dataset defensible and lets a future curator understand residual risk.