Skip to content
Research Data Curation

You should version a humanities dataset whenever it will keep changing, when others depend on a stable reference to a specific state, or when reproducibility requires you to point at "the data as it was on this date". You should not invest in versioning a genuinely frozen, one-off dataset that will never change after deposit. The decision hinges on three signals: does the data evolve, do others cite it, and is the file small and text-based or large and binary? Match the tool to those answers rather than versioning everything reflexively.

When does versioning actually pay off?

Versioning earns its overhead in four situations:

  • Living datasets — a gazetteer or prosopography you keep adding to for years.
  • Collaborative work — several people editing the same TEI files or spreadsheet.
  • Cited data — a published dataset others reference and must be able to pin.
  • Reproducible analysis — code that must run against an exact data state.

If none of these apply — a small extraction frozen at deposit and never touched again — formal versioning is ceremony. A clear v1.0 label and a deposit DOI are enough.

Git, DOI, or DVC — which versioning model fits?

These solve different problems and are often combined.

ModelBest forWeak at
GitSmall text data (CSV, TEI, JSON), line-level diffs, collaborationLarge binaries, non-text formats
Git-LFS / DVCLarge or binary files with lightweight pointersAdds setup and a remote store
Repository version DOIs (Zenodo)Citable public releases, the scholarly recordFine-grained working history

A common, healthy pattern: edit in Git while working, then publish a Zenodo version DOI at each milestone so each public state is citable.

How do I version text-based data with Git?

For CSV, TEI or JSON under a few hundred megabytes, plain Git is ideal because diffs are meaningful line by line.

bash
git init
git add baptisms.csv data_dictionary.md CHANGELOG.md
git commit -m "v1.0: initial 4,012 baptism records, 1813-1837"

# later, after corrections
git add baptisms.csv CHANGELOG.md
git commit -m "v1.1: fix 12 mis-transcribed dates in 1829 register"
git tag v1.1

The tag gives you a named, retrievable state. The CHANGELOG turns an opaque commit history into something a reuser can read.

What about large or binary datasets?

Committing 40 GB of TIFFs into Git poisons the repository forever, because Git keeps every version of every blob. Use a tool designed for it.

bash
# DVC keeps a small .dvc pointer in Git; the data lives in remote storage
dvc init
dvc add scans/            # creates scans.dvc, a tiny pointer file
git add scans.dvc .gitignore
git commit -m "Track scans with DVC"
dvc push                  # uploads the actual bytes to your remote

Now Git tracks the pointer, while the bytes live in S3, a network drive or institutional storage. The history stays light and you can still check out any version.

How should version numbers and changelogs work?

A version number that nobody can interpret is just noise. Adopt a semantic-style scheme and keep a CHANGELOG.

text
MAJOR.MINOR.PATCH
MAJOR  breaking schema change (renamed/removed column)
MINOR  added rows or columns, backward compatible
PATCH  corrections that do not change the schema

## v2.0.0 - 2025-04-10
- BREAKING: renamed `dob` to `birth_date`
## v1.1.0 - 2025-02-18
- Added `parish` column; +312 rows from D2052/3

Cite the exact version in papers; cite the concept DOI when you want readers to follow the latest.

When is versioning the wrong choice?

  • The dataset is frozen at deposit and contractually will not change.
  • It is a tiny, regenerable extraction (just rerun the query).
  • The "data" is really output that should be reproduced from code, not stored.

In these cases a single labelled release plus reproducible code beats a versioning apparatus nobody will maintain.

Does versioning replace backups?

No — and conflating them loses data. Version control is a record of intentional change; it does nothing if the whole repository is deleted or corrupted. Keep 3-2-1 backups alongside versioning. They answer different questions: "what changed?" versus "is there still a copy?".

Key Takeaways

  • Version when data evolves, is collaborative, is cited, or must be reproduced — not reflexively.
  • Plain Git suits small text data; DVC or Git-LFS suit large binaries; DOI versions mark public releases.
  • Never commit large binaries directly into Git — it bloats history permanently.
  • Use semantic-style version numbers backed by a readable CHANGELOG.
  • Frozen one-off datasets need only a label and a DOI, not a versioning system.
  • Versioning and backups are complementary; neither replaces the other.

Frequently Asked Questions

Should I use Git to version a dataset?

Git is excellent for small, text-based, line-oriented data like CSV or TEI under a few hundred megabytes. For large binaries or datasets that change wholesale, Git alone is a poor fit, and Git-LFS, DVC or repository version-DOIs serve better.

When is versioning a dataset not worth it?

Skip formal versioning for one-off, frozen datasets that will never change after deposit, or for tiny throwaway extractions. The overhead of tooling and discipline outweighs the benefit when there is genuinely only ever one state.

What is the difference between Git and DOI versioning?

Git tracks fine-grained line-level changes during active work and is for developers. DOI versioning, as on Zenodo, mints a citable identifier per published release for the historical record. Many projects use Git while working and DOI versions at each public milestone.

How do I version a large binary dataset like images?

Use DVC or Git-LFS, which store large files separately and keep lightweight pointers under version control, or rely on a repository's release-based version DOIs. Committing large binaries directly into Git bloats the history irreversibly.

What should a dataset version number look like?

Semantic-style versioning works well: increment the major number for breaking schema changes, the minor for added rows or fields, and the patch for corrections. Record what changed in a CHANGELOG so the number means something.

Does versioning replace backups?

No. Version control records the history of intentional changes, but a corrupted or deleted repository can still lose everything. Versioning and 3-2-1 backups are complementary, not substitutes for each other.