Best Practices to Process email archives

To process email archives well, normalise the source store into a stable format such as MBOX or EAXS, deduplicate by message identity, extract and identify attachments, then triage for sensitive content before you arrange and describe the result. The aim is a consistent, documented and defensible outcome that you can apply identically across an entire collection. Email is deceptively hard because a single account mixes formats, duplicates, threads and personal data, so a written checklist matters more here than almost anywhere else in born-digital work.

Why is email harder than ordinary file collections?

A mailbox is a database, not a folder of documents. Messages reference each other through threads, the same email appears many times via CC and forwarding, attachments hide inside MIME encoding, and the whole thing often arrives as a proprietary PST or a vendor export. Process it like a file dump and you get duplicates, broken attachments and undiscovered personal data. Treat it as a structured corpus and you can deduplicate, link and review systematically.

What is the standard processing pipeline?

Work in a fixed order so every account is handled the same way:

Capture the original store as a bit-level master and hash it.
Convert to a preservation format (MBOX or EAXS).
Deduplicate by Message-ID or content hash.
Extract attachments and run format identification and a virus scan.
Triage for sensitive content and apply access decisions.
Arrange, describe and package with a manifest.

Tools such as ePADD (Stanford) and DArcMail (US National Archives) cover steps two through five with auditable logs, which is what keeps the work defensible.

How do you convert a PST without losing structure?

Convert on ingest, but keep the original. A reliable open route uses readpst to produce MBOX, after which you import into ePADD or DArcMail for structured handling:

bash

# Convert an Outlook store to one MBOX per folder, recursively
readpst -r -o ./mbox_out -w Smith_Outlook.pst
# Quick sanity count of messages produced
grep -c "^From " ./mbox_out/Inbox.mbox

Record the tool and version in your processing note. If a conversion drops calendar items or contacts, the bit-level master lets you go back.

How do you deduplicate and thread the messages?

Collapse duplicates by hashing normalised headers plus body, or trust the Message-ID when it is present and unique. Threading then reconstructs conversations from the In-Reply-To and References headers, which is essential for review because a reply often carries quoted sensitive text the original did not.

Concern	Open tool	What it gives you
Conversion	readpst, libpff	MBOX / EML masters
Dedup + threading	ePADD, DArcMail	Collapsed, threaded corpus
Sensitive triage	ePADD lexicons, bulk_extractor	Flagged messages and terms
Packaging	bagit.py	Validated bag with manifest

How do you find sensitive information at scale?

You cannot read every message, so triage instead. Run a pattern scan for identifiers and a named-entity pass to surface people, places and organisations, then review by facet:

bash

# Surface candidate identifiers (cards, SSNs, emails) across a store
bulk_extractor -o be_out ./mbox_out/Inbox.mbox
# Review the most actionable feature reports first
ls be_out/*.txt   # pii.txt, ccn.txt, email.txt, telephone.txt

In ePADD, build a lexicon of donor-specific terms and browse by entity, date and sender so a human makes every restriction decision while the machine does the searching.

What should the processing checklist contain?

Original store hashed and stored as a master.
Conversion tool and version recorded.
Duplicate and thread counts before and after.
Attachment count, formats identified, virus scan clean.
Sensitive-term and entity review completed and signed.
Access and restriction decisions logged against the donor agreement.
Final bag validated.

A signed checklist per account is what makes a 200-mailbox collection consistent rather than 200 ad-hoc judgement calls.

Key Takeaways

Treat email as a structured corpus, not a folder of files.
Normalise to MBOX or EAXS on ingest and keep the original store.
Deduplicate and thread before review so quoted text is not missed.
Extract attachments to identify, scan and preserve them as discrete objects.
Triage sensitive content by facet; let a human set every restriction.
Drive the whole process from a signed per-account checklist.

Frequently Asked Questions

What format should I normalise email into for preservation?

EML or MBOX as the preservation master and increasingly EAXS (the XML schema behind DArcMail and ePADD) for structured access. PST and proprietary stores should be converted on ingest because they are fragile and tool-dependent.

Do I keep attachments separate or inline?

Keep them linked to their parent message but extract a copy so you can run format identification and virus scanning on each attachment independently. EAXS preserves the relationship while exposing attachments as discrete files.

How do I deduplicate a messy inbox export?

Hash each message's normalised headers and body, or use the Message-ID, to collapse the duplicates that arise from CC, forwarding and overlapping backups. ePADD and DArcMail both perform this during import.

How do I find sensitive content across thousands of messages?

Run a pattern scan for personal identifiers and a named-entity pass, then sample-review by sender, date range and flagged term rather than reading every message. ePADD's browse and lexicon features are built for exactly this triage.

Can I preserve the original PST file as well?

Yes, keep the original store as a bit-level master alongside the normalised version. The normalised copy is for access and longevity; the original protects you if a conversion ever proves incomplete.

Who decides what gets redacted or restricted?

Appraisal and access decisions belong with the archivist and, where relevant, the donor agreement and data-protection rules, not the conversion tool. The tooling surfaces candidates; a human sets the policy.

Why is email harder than ordinary file collections? ​

What is the standard processing pipeline? ​

How do you convert a PST without losing structure? ​

How do you deduplicate and thread the messages? ​

How do you find sensitive information at scale? ​

What should the processing checklist contain? ​

Key Takeaways ​

Frequently Asked Questions ​

What format should I normalise email into for preservation? ​

Do I keep attachments separate or inline? ​

How do I deduplicate a messy inbox export? ​

How do I find sensitive content across thousands of messages? ​

Can I preserve the original PST file as well? ​

Who decides what gets redacted or restricted? ​

Related reading ​