How to Redact born-digital documents

To redact a born-digital document properly, you must remove the data, not just cover it: work on a copy, use a tool with a true redaction function that deletes the content from the text layer and strips hidden metadata, flatten and re-derive the file, then verify by extracting the text and searching for the sensitive strings. A black rectangle drawn over a PDF is the single most common — and most damaging — mistake, because the words underneath remain fully recoverable. This guide gives you a defensible, repeatable process.

Why a black box is not redaction

When you draw a filled rectangle over text in most PDF or image editors, you add a visual layer on top of unchanged content. The characters still exist in the file's text stream, so anyone can copy-paste them, run pdftotext, or simply delete the rectangle. Real redaction destroys the underlying data. Treat the visual mark as a request to redact; the tool must then act on it by removing the content.

Step 1 — work on a copy, never the master

Keep the unredacted preservation master dark and access-controlled. Derive a working copy and redact that. This keeps the process one-way for the reader (they only ever see the redacted version) while remaining reversible for you, because the original survives under restriction.

bash

cp masters/case-0042.pdf work/case-0042-redact.pdf

Step 2 — apply true redaction to the body

For PDFs, use Acrobat's Redact tool, or open-source pdf-redact-tools / a qpdf flatten step, which remove the marked content rather than overlaying it. The principle: mark, apply (delete), then flatten so nothing original remains beneath.

bash

# Flatten and re-derive so any residual interactive content is dropped
qpdf --flatten-annotations=all work/case-0042-redact.pdf work/case-0042-flat.pdf

# Confirm no sensitive text survived in the text layer
pdftotext work/case-0042-flat.pdf - | grep -i "national insurance"

If that grep returns a match, the redaction failed and you must redo it.

How do you handle metadata and hidden data?

Most real-world leaks are not in the visible text — they hide in document properties, tracked changes, comments, author fields and image EXIF. Strip them explicitly.

bash

# Remove all metadata from the redacted PDF and any images
exiftool -all= -overwrite_original work/case-0042-flat.pdf
exiftool -all= -overwrite_original work/images/*.jpg

For Office documents, also remove tracked changes and the document inspector's hidden content before converting to a delivery format.

What are the common pitfalls?

Pitfall	Why it leaks	Fix
Black box over text	Text layer intact	Use true redaction + flatten
Forgotten metadata	EXIF / properties	`exiftool -all=`
Tracked changes left in	Word stores deletions	Accept/remove, then convert
Redacting the master	No recoverable original	Redact a copy
Image of text not OCR-checked	Hidden OCR layer	Re-OCR or strip the layer
Filename reveals data	`Smith_HIV_2003.pdf`	Rename the derivative

How do you verify it actually worked?

Never trust the visual result. Run three checks on the redacted file: extract the text (pdftotext) and search for every sensitive string; run a sensitivity scan such as bulk_extractor to catch patterns you missed; and copy-paste from the rendered page to confirm nothing is selectable underneath a mark. Only when all three are clean is the file safe to release. Document that you ran them — verification is part of the record.

Can you automate redaction at scale?

Partly, and carefully. Pattern matchers (regular expressions or tools like bulk_extractor) can auto-flag emails, payment-card numbers and ID formats across thousands of files, which is invaluable triage. But a human must review context before applying — automated matching both over-redacts (a date that looks like an ID) and under-redacts (a name with no pattern). Use automation to find candidates and a person to decide.

Key Takeaways

Redaction removes data; a black box only hides it and is fully reversible by anyone.
Always redact a copy and keep the unredacted master dark and access-controlled.
Use a true redaction function, then flatten and re-derive so nothing original survives.
Strip metadata, tracked changes, comments and EXIF — most leaks hide there, not in the body.
Verify by extracting text and searching, scanning with bulk_extractor, and testing copy-paste.
Automate the finding of sensitive data, but keep a human in the loop for the decision.

Frequently Asked Questions

Why isn't drawing a black box over text real redaction?

A black box drawn over a PDF or image only hides the text visually; the underlying characters stay in the file and can be copied, searched, or revealed by removing the layer. True redaction removes the data itself, not just its appearance.

How do I properly redact a PDF?

Use a tool with a true redaction function — Adobe Acrobat's Redact, or qpdf/pdf-redact-tools — that deletes the marked content and strips it from the text layer, then flatten and re-derive the file so no original data remains underneath.

What about metadata and hidden data?

Redaction must include embedded metadata, tracked changes, comments, document properties, and EXIF in images. Strip them with ExifTool or a metadata cleaner; many leaks come from hidden fields, not the visible body text.

Should I redact the master or a copy?

Always a copy. Keep the unredacted master dark and restricted, and publish only the redacted derivative. Redaction is one-way for the reader but reversible for you, because you retain the original under access control.

How do I verify a redaction actually worked?

Extract the text from the redacted file (for example with pdftotext) and search for the sensitive strings; run a sensitivity scan such as bulk_extractor; and copy-paste from the rendered file. If anything reappears, the redaction failed.

Can redaction be automated at scale?

Partly. Pattern-based tools can auto-flag emails, card numbers and IDs, but a human must review context before applying redactions, because automated matching both misses and over-redacts.

Why a black box is not redaction ​

Step 1 — work on a copy, never the master ​

Step 2 — apply true redaction to the body ​

How do you handle metadata and hidden data? ​

What are the common pitfalls? ​

How do you verify it actually worked? ​

Can you automate redaction at scale? ​

Key Takeaways ​

Frequently Asked Questions ​

Why isn't drawing a black box over text real redaction? ​

How do I properly redact a PDF? ​

What about metadata and hidden data? ​

Should I redact the master or a copy? ​

How do I verify a redaction actually worked? ​

Can redaction be automated at scale? ​

Related reading ​

Why a black box is not redaction

Step 1 — work on a copy, never the master

Step 2 — apply true redaction to the body

How do you handle metadata and hidden data?

What are the common pitfalls?

How do you verify it actually worked?

Can you automate redaction at scale?

Key Takeaways

Frequently Asked Questions

Why isn't drawing a black box over text real redaction?

How do I properly redact a PDF?

What about metadata and hidden data?

Should I redact the master or a copy?

How do I verify a redaction actually worked?

Can redaction be automated at scale?

Related reading