Appearance
Balancing anonymisation and utility means removing enough identifying detail to protect people while keeping enough to answer your research question — and accepting that you cannot maximise both at once. The practical answer is to generalise rather than delete, target the rare records that actually leak identity, and offer two tiers: a coarse open dataset and a detailed restricted one. This guide runs the workflow end to end with a worked example.
Why can't I just have both?
Privacy and utility pull in opposite directions on the same fields. The birth year, occupation and parish that let you trace social mobility are exactly the combination that re-identifies a person. Strip them and the data is safe but mute; keep them and it is rich but exposing. So you do not "solve" the tension — you tune it to the minimum protection that prevents harm while preserving the patterns you need.
What actually identifies people — and it is not just names
Removing names is the beginner mistake. Re-identification works through quasi-identifiers: fields that combine to single someone out. A famous result showed that birth date, sex and postcode alone identify most of a population. In historical data the equivalents are birth year + occupation + parish, or admission date + ward + age.
text
name removed? -> necessary, not sufficient
unique combo of fields? -> the real risk
e.g. "1847, blacksmith, Eyam" may match exactly one personAudit your quasi-identifiers before you touch anything else.
How do I measure the risk? A k-anonymity check
k-anonymity asks: across your quasi-identifiers, how many people share each record's exact profile? If any profile is unique (k = 1), that person is exposed.
python
import pandas as pd
qi = ["birth_year", "occupation", "parish"]
group_sizes = df.groupby(qi).size()
df["k"] = df.set_index(qi).index.map(group_sizes)
print("smallest k:", df["k"].min())
print("records with k < 5:", (df["k"] < 5).sum())Any record with a low k is a candidate for generalisation or suppression. Aim for a k threshold appropriate to sensitivity — 5 is a common floor, higher for special-category data.
Which technique protects most utility?
Prefer the gentlest tool that hits your k target. Ranked from least to most destructive:
| Technique | What it does | Utility cost |
|---|---|---|
| Generalisation | decade not year, county not parish | low — patterns survive |
| Suppression | blank only the rare outliers | low if few records |
| Pseudonymisation | replace IDs, keep a secure key | none for analysis; still personal data |
| Perturbation | add small random noise | medium; breaks exact joins |
| Aggregation | release counts, not individuals | high; loses record-level study |
Generalise broadly first, then suppress the handful of stubborn outliers, rather than blunt-force deleting whole columns.
A worked generalisation pass
Coarsen the quasi-identifiers just enough to lift everyone above your k floor:
python
# generalise to reduce uniqueness
df["birth_decade"] = (df["birth_year"] // 10) * 10
df["region"] = df["parish"].map(parish_to_county)
qi2 = ["birth_decade", "occupation_group", "region"]
sizes = df.groupby(qi2).size()
df["k2"] = df.set_index(qi2).index.map(sizes)
# suppress the residue that is still unique
df.loc[df["k2"] < 5, ["occupation_group"]] = "withheld"Re-run the k check; iterate until the minimum clears your threshold. Each pass trades a slice of granularity for safety — stop the moment you are safe, not later.
Should I publish one dataset or two?
Two, almost always. The two-tier model resolves most of the tension:
- Open tier — generalised,
k-anonymous, freely downloadable for teaching and broad reuse. - Restricted tier — full detail under a data-access agreement, for vetted researchers with a stated purpose.
This way the public utility and the protective layer coexist instead of fighting. Document which fields differ between tiers so no one mistakes the coarse version for the complete one.
How do I document the trade-off I made?
Record the decisions or the dataset is unreproducible. In the data dictionary note, for each field: original granularity, released granularity, and why. State your k threshold, the suppression rule, and whether the open tier is anonymised or merely pseudonymised (which is legally still personal data). A short methods note saying "birth year generalised to decade; profiles below k=5 suppressed; full data available under agreement" tells reusers exactly what they can and cannot conclude.
Key Takeaways
- Anonymisation and utility trade off on the same fields — tune the balance, do not maximise either.
- Re-identification runs through quasi-identifiers, not just names; audit field combinations first.
- Use a k-anonymity check to find records that uniquely expose someone; a k floor of 5 is a common baseline.
- Generalise (decade, county) before you suppress, and suppress before you delete whole columns.
- Pseudonymisation keeps a re-linkage key and is still personal data; true anonymisation is irreversible.
- Publish a coarse open tier plus a detailed restricted tier, and document every field's granularity.
Frequently Asked Questions
Why is there a trade-off between anonymisation and utility?
Every detail you remove or blur to protect identity also removes analytical value. Stripping ages, places, and dates protects people but destroys the very patterns historians study, so you tune the balance rather than maximising either end.
What is a quasi-identifier?
A field that is not unique on its own but can identify someone when combined with others — like birth year, occupation, and parish together. Quasi-identifiers, not just names, are where re-identification happens.
What does k-anonymity mean in plain terms?
It means every record looks identical to at least k-1 others on its quasi-identifiers, so no individual stands out. A k of 5 means each combination is shared by at least five people.
Is pseudonymisation the same as anonymisation?
No. Pseudonymisation replaces identifiers with codes but keeps a re-linkage key, so the data is still personal data legally. True anonymisation is irreversible and falls outside data-protection law.
How do I keep utility high while protecting privacy?
Generalise rather than delete (decade not year, county not parish), suppress only the rare outliers, and release detailed data under restricted access while publishing a coarser open version.