Track data provenance: A Practical Guide

To track data provenance, you record an unbroken chain from each original source to your final dataset: where the data came from, every transformation you applied, the tool and version that did it, when, and who was responsible. Done well, a reader can take your raw archival scans and reproduce your analysis dataset step by step — and audit every cleaning decision along the way. The good news for working historians is that you need discipline more than software: Git, checksums, a transformation log and a data dictionary cover most of it.

What exactly counts as provenance?

Provenance is more than a source citation. It is the lineage of the data. The W3C PROV model captures it with three pieces:

Entity — a thing: letters_raw.csv, letters_clean.csv, a scanned register.
Activity — a process that used or produced entities: "normalise dates", "deduplicate".
Agent — who or what was responsible: you, a script, an institution.

Every transformation links an input entity, an activity and an agent to an output entity. String those links together and you have a provenance graph that answers "how did this value get here?"

How do I capture provenance without specialist tools?

Build it into the workflow you already have. Four cheap habits do the heavy lifting:

Never edit raw data in place. Keep data/raw/ read-only and write outputs to data/processed/.
Make every transformation a logged script, not a manual spreadsheet edit.
Checksum each file so you can prove it has not changed.
Record each step in a transformation log.

bash

# fingerprint the raw inputs the moment you receive them
sha256sum data/raw/*.csv > data/raw/MANIFEST.sha256
# later, prove nothing drifted
sha256sum -c data/raw/MANIFEST.sha256

What does a usable transformation log look like?

A plain append-only file beside the data, one row per step, is enough to reconstruct lineage:

text

step  date        input              activity                output             tool/version   agent
1     2024-09-10  raw/letters.csv    strip BOM, fix encoding processed/utf8.csv  python 3.11    E.Reed
2     2024-09-11  utf8.csv           normalise sender names  named.csv          openrefine 3.8 E.Reed
3     2024-09-12  named.csv          drop undated rows (n=37) analysis.csv       pandas 2.2     E.Reed

Two things make this powerful: it records counts of what was removed (37 undated rows) and the tool version, so a later reader knows both the scale of a decision and how to reproduce it.

How do I make provenance machine-readable with PROV?

When a project warrants it, serialise the same information as PROV so it can be queried and merged. In PROV-N:

text

entity(raw_letters)
entity(analysis_csv)
activity(clean, 2024-09-12, -)
agent(ereed)
wasGeneratedBy(analysis_csv, clean, -)
used(clean, raw_letters, -)
wasAssociatedWith(clean, ereed, -)

Python's prov library or simply embedding PROV-O in your metadata lets a repository or a reviewer traverse the lineage automatically — useful for linked-data and FAIR workflows.

Why does this matter so much for historical sources?

Historical data is interpretive at every join. When you decide that "Eliz.", "Elizabeth" and "Elisabeth" are one person, or that an undated letter is excluded, you are shaping the result. Provenance turns those silent editorial choices into an auditable record. A reviewer who disagrees with your name-matching can see exactly where it happened and test an alternative — which is the difference between a defensible argument and an unfalsifiable one.

Key Takeaways

Provenance is the full lineage — source, every transformation, tool/version, time and responsible agent — not just a citation.
The W3C PROV model frames it as Entities, Activities and Agents linked by relations.
Keep raw data read-only, make every transformation a logged script, and write outputs separately.
Use SHA-256 checksums to prove files are unchanged since you documented them.
A simple append-only transformation log that records counts and tool versions is often enough.
Serialise to PROV when you need machine-readable, queryable lineage for FAIR or linked-data work.

Frequently Asked Questions

What is data provenance in research?

Data provenance is the documented record of where each piece of data came from, what was done to it, and by whom — a traceable chain from the original source to your final dataset. It lets a reader reconstruct and trust how your numbers were produced.

How is provenance different from a citation?

A citation names the source. Provenance records the whole lineage: the source, every transformation applied, the tools and versions used, timestamps and the person responsible. Citation is one link in a provenance chain.

What is the W3C PROV model?

PROV is a W3C standard that describes provenance with three core ideas — Entities (things, like a dataset), Activities (what processed them), and Agents (who or what was responsible) — plus relations between them. It gives provenance a shared vocabulary.

Do I need special software to track provenance?

No. Disciplined Git history, a data dictionary, checksums and a logged transformation script get you most of the way. Dedicated tools help at scale, but the habit matters more than the tool.

How do I prove a dataset hasn't changed since I documented it?

Record a cryptographic checksum (such as SHA-256) of each file. Re-computing the hash later and comparing it proves the bytes are identical to what your provenance record describes.

Why does provenance matter for historical and archival data?

Because interpretation depends on knowing how messy sources were cleaned, normalised and joined. Undocumented decisions — which spelling won, which records were dropped — silently shape conclusions, and provenance makes them auditable.

What exactly counts as provenance? ​

How do I capture provenance without specialist tools? ​

What does a usable transformation log look like? ​

How do I make provenance machine-readable with PROV? ​

Why does this matter so much for historical sources? ​

Key Takeaways ​

Frequently Asked Questions ​

What is data provenance in research? ​

How is provenance different from a citation? ​

What is the W3C PROV model? ​

Do I need special software to track provenance? ​

How do I prove a dataset hasn't changed since I documented it? ​

Why does provenance matter for historical and archival data? ​

Related reading ​