Appearance
To redact a born-digital document properly, you must remove the data, not just cover it: work on a copy, use a tool with a true redaction function that deletes the content from the text layer and strips hidden metadata, flatten and re-derive the file, then verify by extracting the text and searching for the sensitive strings. A black rectangle drawn over a PDF is the single most common — and most damaging — mistake, because the words underneath remain fully recoverable. This guide gives you a defensible, repeatable process.
Why a black box is not redaction
When you draw a filled rectangle over text in most PDF or image editors, you add a visual layer on top of unchanged content. The characters still exist in the file's text stream, so anyone can copy-paste them, run pdftotext, or simply delete the rectangle. Real redaction destroys the underlying data. Treat the visual mark as a request to redact; the tool must then act on it by removing the content.
Step 1 — work on a copy, never the master
Keep the unredacted preservation master dark and access-controlled. Derive a working copy and redact that. This keeps the process one-way for the reader (they only ever see the redacted version) while remaining reversible for you, because the original survives under restriction.
bash
cp masters/case-0042.pdf work/case-0042-redact.pdfStep 2 — apply true redaction to the body
For PDFs, use Acrobat's Redact tool, or open-source pdf-redact-tools / a qpdf flatten step, which remove the marked content rather than overlaying it. The principle: mark, apply (delete), then flatten so nothing original remains beneath.
bash
# Flatten and re-derive so any residual interactive content is dropped
qpdf --flatten-annotations=all work/case-0042-redact.pdf work/case-0042-flat.pdf
# Confirm no sensitive text survived in the text layer
pdftotext work/case-0042-flat.pdf - | grep -i "national insurance"If that grep returns a match, the redaction failed and you must redo it.
How do you handle metadata and hidden data?
Most real-world leaks are not in the visible text — they hide in document properties, tracked changes, comments, author fields and image EXIF. Strip them explicitly.
bash
# Remove all metadata from the redacted PDF and any images
exiftool -all= -overwrite_original work/case-0042-flat.pdf
exiftool -all= -overwrite_original work/images/*.jpgFor Office documents, also remove tracked changes and the document inspector's hidden content before converting to a delivery format.
What are the common pitfalls?
| Pitfall | Why it leaks | Fix |
|---|---|---|
| Black box over text | Text layer intact | Use true redaction + flatten |
| Forgotten metadata | EXIF / properties | exiftool -all= |
| Tracked changes left in | Word stores deletions | Accept/remove, then convert |
| Redacting the master | No recoverable original | Redact a copy |
| Image of text not OCR-checked | Hidden OCR layer | Re-OCR or strip the layer |
| Filename reveals data | Smith_HIV_2003.pdf | Rename the derivative |
How do you verify it actually worked?
Never trust the visual result. Run three checks on the redacted file: extract the text (pdftotext) and search for every sensitive string; run a sensitivity scan such as bulk_extractor to catch patterns you missed; and copy-paste from the rendered page to confirm nothing is selectable underneath a mark. Only when all three are clean is the file safe to release. Document that you ran them — verification is part of the record.
Can you automate redaction at scale?
Partly, and carefully. Pattern matchers (regular expressions or tools like bulk_extractor) can auto-flag emails, payment-card numbers and ID formats across thousands of files, which is invaluable triage. But a human must review context before applying — automated matching both over-redacts (a date that looks like an ID) and under-redacts (a name with no pattern). Use automation to find candidates and a person to decide.
Key Takeaways
- Redaction removes data; a black box only hides it and is fully reversible by anyone.
- Always redact a copy and keep the unredacted master dark and access-controlled.
- Use a true redaction function, then flatten and re-derive so nothing original survives.
- Strip metadata, tracked changes, comments and EXIF — most leaks hide there, not in the body.
- Verify by extracting text and searching, scanning with bulk_extractor, and testing copy-paste.
- Automate the finding of sensitive data, but keep a human in the loop for the decision.
Frequently Asked Questions
Why isn't drawing a black box over text real redaction?
A black box drawn over a PDF or image only hides the text visually; the underlying characters stay in the file and can be copied, searched, or revealed by removing the layer. True redaction removes the data itself, not just its appearance.
How do I properly redact a PDF?
Use a tool with a true redaction function — Adobe Acrobat's Redact, or qpdf/pdf-redact-tools — that deletes the marked content and strips it from the text layer, then flatten and re-derive the file so no original data remains underneath.
What about metadata and hidden data?
Redaction must include embedded metadata, tracked changes, comments, document properties, and EXIF in images. Strip them with ExifTool or a metadata cleaner; many leaks come from hidden fields, not the visible body text.
Should I redact the master or a copy?
Always a copy. Keep the unredacted master dark and restricted, and publish only the redacted derivative. Redaction is one-way for the reader but reversible for you, because you retain the original under access control.
How do I verify a redaction actually worked?
Extract the text from the redacted file (for example with pdftotext) and search for the sensitive strings; run a sensitivity scan such as bulk_extractor; and copy-paste from the rendered file. If anything reappears, the redaction failed.
Can redaction be automated at scale?
Partly. Pattern-based tools can auto-flag emails, card numbers and IDs, but a human must review context before applying redactions, because automated matching both misses and over-redacts.