How to Cite software and data

To cite software and data correctly, point to a versioned, archived object with a persistent identifier: name the software or dataset, its authors, the exact version, and a DOI. A citation without a version and a persistent link cannot identify what you actually used, so it fails the two jobs of citation — giving credit and supporting reproducibility. Prefer a Zenodo or repository DOI over a bare GitHub or website URL, which can change or vanish.

This step-by-step guide covers software and data citation for humanities work, the elements that matter, and the pitfalls that quietly undermine reproducibility.

Why cite software and data, not just publications?

Your results depend on the tools and inputs as much as the argument. If you normalised place-names with a particular gazetteer release, or ran network metrics in a specific version of a library, those choices shaped the findings. Citing them gives credit to the people who built and curated them, and lets a reader fetch the exact tool and data to reproduce or scrutinise your work. Uncited software is a silent dependency that breaks reproducibility the moment the tool changes.

What elements must a software citation contain?

Six elements, with version and identifier being non-negotiable:

Authors or development team
Title / name of the software
Exact version (e.g. 2.2.1)
Year of that release
Publisher or repository
Persistent identifier (DOI preferred)

A concrete example in prose: Reed, E. (2025). gazetteer-match (Version 1.3.0) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.xxxxxxx. The version pins behaviour; the DOI pins location for the long term.

How do I make my own software citable?

Add a CITATION.cff file to the repository root so others can cite you correctly without guessing:

yaml

cff-version: 1.2.0
title: gazetteer-match
message: "If you use this software, please cite it as below."
authors:
  - family-names: Reed
    given-names: Elara
version: 1.3.0
doi: 10.5281/zenodo.1234567
date-released: 2025-05-10
license: MIT

GitHub reads this file and shows a "Cite this repository" button with ready-made BibTeX and APA exports. Connect the repository to Zenodo so each release is archived and assigned a DOI, then put that DOI in the file.

How do I cite a dataset, including the right version?

Datasets evolve — records get added, errors corrected — so generic links are inadequate. Deposit data in a repository that mints a DOI per version (Zenodo, Dryad, the UK Data Service, a Dataverse), and cite the DOI of the exact version you used:

Pitfall	Better practice
Linking to a live website table	Cite an archived, versioned dataset DOI
Citing "the latest version"	Cite the specific version you analysed
No retrieval date for live sources	Record retrieval date if no version exists
Omitting the dataset's own authors	Credit the data creators and curators

If the data only exists as a live web resource with no DOI, cite it with an explicit retrieval date and, where possible, archive a snapshot you can reference.

What is the version pitfall, and how do I avoid it?

The single most common failure is omitting the version. "Analysis run in Python with pandas" tells a reader nothing reproducible, because pandas 1.x and 2.x differ in behaviour. Capture exact versions automatically:

bash

pip freeze > requirements.txt      # pins every package version
python --version                   # record the interpreter too

Cite the headline tools in your methods section and let requirements.txt (committed to the repository) carry the full dependency list. Now your environment is both cited and reproducible.

Where do software and data citations belong in my paper?

Treat them as first-class references, not footnotes buried in prose. List software and datasets in your reference list with their DOIs, mention the key tools and the data DOI in the methods or data-availability statement, and ensure every DOI resolves. Reviewers increasingly check this, and journals in DH now commonly require a data-availability statement that names archived, citable sources.

Key Takeaways

Cite software and data because they shape results; uncited dependencies break reproducibility.
Always include the exact version and a persistent identifier, ideally a DOI.
Add a CITATION.cff and connect Zenodo so others can cite your software precisely.
Cite the versioned DOI of the exact dataset release you used, not a generic project link.
Capture environments with pip freeze so your full dependency set is pinned and reproducible.
Place software and data citations in the reference list and a data-availability statement, and confirm every DOI resolves.

Frequently Asked Questions

Why should I cite software at all?

Software is a research contribution that shapes your results, and citing it gives credit, supports reproducibility, and lets readers find the exact tool and version you used.

What is the most important element of a software citation?

The exact version. Software changes behaviour between releases, so a citation without a version number cannot pin down the tool that produced your result.

How do I cite a specific version of a dataset?

Cite the versioned DOI for that release, not a generic project link. Repositories like Zenodo and Dryad mint a distinct DOI per version precisely so you can pin one.

What is a CITATION.cff file?

A small machine-readable YAML file in a repository's root that states how the software should be cited. GitHub reads it to show a Cite button and export formats.

Should I cite tools like pandas or QGIS in a humanities paper?

Yes, when they materially shaped your analysis. Citing the core tools that produced your figures is good scholarly practice and increasingly expected by reviewers.

Why cite software and data, not just publications? ​

What elements must a software citation contain? ​

How do I make my own software citable? ​

How do I cite a dataset, including the right version? ​

What is the version pitfall, and how do I avoid it? ​

Where do software and data citations belong in my paper? ​

Key Takeaways ​

Frequently Asked Questions ​

Why should I cite software at all? ​

What is the most important element of a software citation? ​

How do I cite a specific version of a dataset? ​

What is a CITATION.cff file? ​

Should I cite tools like pandas or QGIS in a humanities paper? ​

Related reading ​