Appearance
A pre-deposit curation checklist is a fixed set of checks every dataset must pass before it goes into a repository: completeness, documentation, open formats, declared rights, sensitive-data handling and verified integrity. Running the same checklist on every deposit is what keeps quality consistent and your decisions defensible months later. The goal is simple — nobody, including future-you, should have to guess whether a deposit was checked.
Why a written checklist beats good intentions
Curation done from memory drifts: one dataset gets a data dictionary, the next does not, and you cannot say why. A written checklist makes the process auditable and repeatable across a whole project. It also separates mechanical checks (automatable) from judgement checks (human), so you spend attention where it matters. Treat it as a gate, not a suggestion.
The core checklist
text
COMPLETENESS
[ ] All expected files present; no placeholders or temp files
[ ] File and folder names follow a documented convention
DOCUMENTATION
[ ] README explains contents, structure and how to use
[ ] Data dictionary defines every field and value domain
[ ] Methods / provenance recorded
FORMATS
[ ] Open preservation formats (CSV, UTF-8, PDF/A, TIFF)
[ ] Originals kept if conversion is lossy
RIGHTS & ETHICS
[ ] Licence declared (CC BY / CC0 / ODbL)
[ ] Sensitive/personal data anonymised or access-controlled
INTEGRITY
[ ] SHA-256 checksums generated and stored
[ ] Files validated against their stated formatHow do I check file integrity?
Generate a checksum manifest so corruption in transfer is detectable. SHA-256 is the sensible default:
bash
# Linux/macOS
find . -type f -exec sha256sum {} \; > manifest.sha256
# verify later
sha256sum -c manifest.sha256powershell
# Windows PowerShell
Get-ChildItem -Recurse -File |
Get-FileHash -Algorithm SHA256 |
Export-Csv manifest.csv -NoTypeInformationStore the manifest with the data; the repository re-verifies it on ingest.
How do I confirm files are really the format they claim?
A file named .csv may be tab-delimited Latin-1; a .tiff may be truncated. Identify and validate formats rather than trusting extensions:
- DROID (from The National Archives) identifies formats by signature and emits a report.
- JHOVE validates and confirms well-formedness for PDF, TIFF, XML and more.
bash
# JHOVE: validate a TIFF
jhove -m TIFF-hul -h text images/plate_01.tiffA "Well-Formed and valid" result is your green light; anything else gets fixed before deposit.
Mechanical vs judgement checks
| Check | Type | Tool |
|---|---|---|
| Checksums | Mechanical | sha256sum / Get-FileHash |
| Format identification | Mechanical | DROID |
| Format validation | Mechanical | JHOVE |
| Encoding (UTF-8) | Mechanical | file, chardet |
| README quality | Judgement | Human review |
| Sensitive-data handling | Judgement | Human + policy |
| Licence appropriateness | Judgement | Human |
Automate the top half; never automate the bottom half away.
Who should run it, and when?
Best practice is a second person, because the creator stops seeing their own assumptions. If you must self-review, write the checklist down, leave the dataset for a few days, then run it cold against the document rather than from memory. Record the outcome — a dated, signed-off checklist attached to the deposit is the artefact that makes the work defensible.
Turning the checklist into a habit
Keep the checklist in the project repository as CURATION_CHECKLIST.md and copy it, ticked and dated, into each deposit folder. Over a project this builds a consistent paper trail: anyone auditing your data sees exactly what was checked, by whom, and when. Consistency, not heroics, is what makes curation trustworthy at scale.
Key Takeaways
- Use one written checklist for every deposit so quality stays consistent and auditable.
- The five highest-value checks: format opens, documentation exists, licence declared, sensitive data handled, checksums recorded.
- Generate SHA-256 checksums and store the manifest with the data.
- Identify and validate formats with DROID and JHOVE rather than trusting extensions.
- Convert to open preservation formats but keep lossy-conversion originals.
- Automate mechanical checks; keep human judgement for documentation and ethics.
- Have a second person run it, and attach a dated sign-off to the deposit.
Frequently Asked Questions
What is a pre-deposit curation checklist?
It is a standard list of checks a dataset must pass before it is deposited in a repository, covering completeness, documentation, formats, rights and integrity. Running the same checklist on every deposit keeps quality consistent and decisions defensible.
What are the most important checks before depositing data?
Verify that files open in their stated format, that a README and data dictionary exist, that a licence is declared, that sensitive data is handled, and that checksums are recorded. These five catch the majority of deposit problems.
Should I run curation checks manually or automate them?
Automate the mechanical checks such as fixity, format identification and encoding, and reserve human judgement for documentation quality and ethical review. Tools like JHOVE, DROID and a checksum script handle the mechanical layer reliably.
How do I check file integrity before deposit?
Generate a checksum, typically SHA-256, for every file and store the manifest with the dataset. The repository can then verify the same checksums on ingest, proving nothing was corrupted in transfer.
What formats should I convert to before depositing?
Convert proprietary or fragile formats to open, well-supported preservation formats: CSV for tables, plain UTF-8 or PDF/A for text, TIFF for images. Keep the original alongside if conversion risks losing information.
Who should run the checklist?
Ideally someone other than the data creator, because a second pair of eyes catches assumptions the creator no longer sees. If that is not possible, leave a gap of a few days before self-reviewing against the written checklist.