Skip to content
Born-Digital Archives

Detect sensitive information in born-digital material whenever you intend to provide access to correspondence, administrative records, or personal devices, and the volume is too large to read by hand. Skip or lighten it only for already-public, fully-open, or trivially small material. The real question is not whether sensitive data exists, but whether automated detection earns its cost on this collection, and how much human review the results will demand. This article is about making that call deliberately rather than scanning everything reflexively or nothing at all.

When does automated detection actually pay off?

Automated detection pays off when three things line up: the material is unvetted, you will open it to researchers, and there is simply too much to read. A 40 GB laptop image or a 60-mailbox account cannot be reviewed line by line, so a pattern scan that flags candidate identifiers turns an impossible job into a triage exercise. Conversely, a 30-file deposit of already-published reports needs a glance, not a pipeline.

What signals tell you a scan is needed?

Look for these markers in the accession before deciding:

  • Personal devices, email, or social media exports.
  • Records about living individuals or named third parties.
  • Financial, medical, legal or HR content.
  • Free-text rather than structured, already-redacted data.
  • A donor agreement that restricts personal data.

Any two of these together, and an automated first pass is the responsible default.

What do the tools detect, and how well?

Pattern-based scanners excel at structured identifiers and struggle with context. bulk_extractor sweeps an entire disk image for credit-card numbers, government IDs, emails and telephone numbers, even in unallocated space. Microsoft Presidio adds named-entity recognition for names, locations and organisations in text.

bash
# Sweep a disk image for structured identifiers
bulk_extractor -o be_out laptop.E01
# The actionable reports:
#   ccn.txt  pii.txt  telephone.txt  email.txt
python
# Context-aware scan of extracted text with Presidio
from presidio_analyzer import AnalyzerEngine
analyzer = AnalyzerEngine()
results = analyzer.analyze(text=open("message.txt").read(), language="en")
for r in results:
    print(r.entity_type, r.score, text[r.start:r.end])

Neither is authoritative. Expect false positives on number-like strings and misses on unusual formats; both produce a worklist, not a verdict.

When should you NOT lean on detection?

SituationRecommended posture
Already public materialMinimal or no scan
Fully open by donor agreementLight spot-check only
Tiny, readable depositManual review beats tooling
Highly structured, pre-redacted dataTargeted scan, not full sweep
Large, mixed, personal materialFull automated triage

Detection is a cost as well as a control. Running heavy scanners on clearly-open material burns staff time adjudicating false positives for no protective benefit.

How do you keep false positives from drowning you?

Treat scanner output as a worklist to be tuned, not a list to be read whole. Sort flags by pattern type and frequency, suppress known-good values with a stop list, and sample-review the remainder. Crucially, record the decisions so the tuning carries forward:

bash
# Collapse a noisy report to unique candidates, most frequent first
sort be_out/pii.txt | uniq -c | sort -rn | head -50

A reusable stop list of organisational phone formats or test card numbers can cut review volume dramatically across a series of similar accessions.

Who makes the final call?

The archivist does, every time. Detection narrows thousands of items to a reviewable few hundred, but the decision to restrict, redact or release weighs the flag against the donor agreement, data-protection law and the public interest. Document the reasoning, because an access decision you cannot explain is one you cannot defend.

Key Takeaways

  • Detect when material is unvetted, personal, and too large to read manually.
  • Skip or lighten detection for public, fully-open, or trivially small deposits.
  • Pattern scanners catch structured identifiers; NER adds names and context.
  • Every tool produces a worklist with false positives, never a verdict.
  • Tune with stop lists and reuse the tuning across similar collections.
  • A human archivist makes and documents every access decision.

Frequently Asked Questions

When is automated sensitive-data detection worth running?

Run it whenever a collection contains correspondence, administrative records or personal devices and you plan to provide access. The larger and less curated the material, the more an automated first pass saves over manual review.

When can I skip detection entirely?

You can reasonably skip it for material that is already public, fully open by donor agreement, or so small that one archivist can read every item faster than configuring a tool. Even then, a quick scan is cheap insurance.

What counts as sensitive information here?

Personal identifiers such as government numbers, financial and health data, credentials, and information about living individuals or third parties who did not consent. Context matters: a name alone may be fine, a name plus a medical detail may not.

Does automated detection make the access decision for me?

No. Tools like bulk_extractor and Presidio surface candidates with false positives and misses. A human archivist judges each flag against the donor agreement and data-protection law before restricting or redacting.

What does detection cost in practice?

Mostly staff time reviewing flags, plus compute for large images. The expensive part is human adjudication of false positives, so tuning the scanners to your material pays for itself quickly.

How do I handle the false positives a scanner produces?

Treat the output as a worklist, sort by pattern type and frequency, suppress known-good patterns with stop lists, and sample-review the rest. Record your decisions so the next collection inherits the tuning.