Skip to content
Research Data Curation

When licensing research data goes wrong, the root cause is almost always one of three things: you applied a licence to rights you do not hold, you used a software licence on data, or you released a mix of open and restricted material under a single blanket licence. Fix these by separating what rights exist, who holds them, and which files they cover before you pick any licence string. A licence is a permission you grant — you cannot grant what you do not own.

Why is my repository rejecting the licence?

Most repository rejections trace to a category error. Zenodo, the UK Data Service and Dryad all distinguish data licences from software licences. If you tag a CSV with MIT or GPL-3.0, automated checks flag it because those licences assume copyrightable code, while raw data may be facts (uncopyrightable) or protected by a separate database right. The fix is mechanical:

AssetRecommended licencesAvoid
Tabular / structured dataCC0, CC BY 4.0, ODbL, PDDLMIT, GPL, Apache
Accompanying code/scriptsMIT, Apache 2.0, BSD-3CC BY (ambiguous on code)
Documentation / data paperCC BY 4.0CC0 (loses attribution)
Mixed depositPer-file licencesOne blanket licence

Diagnose: do you actually hold the rights?

The most common silent error is licensing content you transcribed or scraped from protected sources. Run this triage:

text
Q1. Are the underlying sources in copyright? (e.g. post-1900 text)
    -> If yes, you cannot openly licence the source content.
Q2. Did YOU add original structure, annotation or selection?
    -> You may licence YOUR contribution, not the underlying work.
Q3. Is the data purely factual (dates, coordinates, measurements)?
    -> Facts are not copyrightable, but the COMPILATION may carry a database right.

If Q1 is yes, you licence your transcription's editorial layer and clearly state the source's status using a rights statement (rightsstatements.org has standard URIs for this).

What is the database right and how does it bite?

In the UK and EU, the sui generis database right protects substantial investment in obtaining, verifying or presenting a database — independently of copyright in the contents. So a gazetteer of public-domain place names can still be encumbered. To release it cleanly, use a waiver that addresses it explicitly:

  • PDDL (Public Domain Dedication and Licence) — waives database right, like CC0 for data.
  • ODbL — share-alike for databases, requires attribution and openness of derivatives.
  • CC0 1.0 — version 4.0 of CC licences also covers sui generis rights, so CC0 is now safe for databases too.

Fix: splitting a mixed-rights deposit

When a deposit blends your open data with restricted records, do not compromise on one licence. Structure it:

text
dataset/
  open/            # CC BY 4.0 — your transcriptions and derived tables
    LICENSE.txt
  restricted/      # access-controlled; data-access agreement required
    ACCESS_TERMS.md
  code/            # MIT — cleaning and analysis scripts
    LICENSE
  README.md        # states which terms apply to which folder

State the mapping in the README so a reuser is never guessing.

How do I declare the licence so machines can read it?

Human-readable text is not enough; discovery systems need a machine-readable URI. In dataset metadata (DataCite, DCAT) declare the licence with its canonical SPDX or Creative Commons URL:

xml
<rightsList>
  <rights rightsURI="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</rights>
</rightsList>

For code, add an SPDX identifier in a LICENSE file and reference it in codemeta.json. Consistent URIs let aggregators filter by reusability.

Common errors and their real fixes

  • "No licence at all." The default is all rights reserved — reusers legally cannot touch it. Always add an explicit licence, even CC0.
  • "CC BY-NC for everything." Non-commercial clauses block legitimate reuse (teaching companies, some universities) and are not "open". Reserve NC only when genuinely needed.
  • "Two contradictory licences in different files." Pick one authoritative LICENSE per asset type and remove strays.

Key Takeaways

  • You can only licence rights you hold; check copyright in the underlying sources first.
  • Use Creative Commons or Open Data Commons for data; reserve MIT/Apache for code.
  • The EU/UK database right can restrict even factual data — use PDDL, ODbL or CC0 (4.0) to waive it.
  • Split mixed-rights deposits into per-folder licences rather than one blanket grant.
  • Declare licences with machine-readable URIs in your metadata.
  • "No licence" means all rights reserved — never leave it blank.
  • Avoid non-commercial clauses unless a real constraint demands them.

Frequently Asked Questions

Which licence should I use for a humanities dataset?

For data, CC0 or CC BY 4.0 are the most reusable and repository-recommended choices. Use CC0 to maximise reuse and avoid attribution-stacking; use CC BY if your funder or institution requires attribution.

No. You can only licence rights you hold. If a dataset transcribes copyrighted sources or third-party records, you may licence your own structuring and annotations but not the underlying protected content.

Why does my repository reject a GPL or MIT licence on data?

Software licences are written for code, not data, and create ambiguity about whether facts and databases are covered. Repositories prefer Creative Commons or the Open Data Commons family for datasets and reserve code licences for accompanying scripts.

What is the database right and why does it matter in the EU and UK?

The sui generis database right protects substantial investment in compiling a database, separate from copyright in its contents. It means a dataset of public-domain facts can still carry restrictions, so use ODbL or a CC waiver to release it cleanly.

How do I licence a dataset that mixes my data with restricted records?

Split it. Release the openly licensable portion under CC BY or CC0, document the restricted portion separately under a data-access agreement, and never apply a single open licence across both.

Should code and data in the same deposit share one licence?

No. Licence code under a software licence such as MIT or Apache 2.0 and data under Creative Commons or ODC. State both clearly in a LICENSE file and the README so reusers know which terms apply to which files.