Add discovery metadata to datasets: A Practical Guide

To add discovery metadata to a dataset, describe it with the DataCite schema — at minimum the six mandatory fields (Identifier, Creator, Title, Publisher, PublicationYear, ResourceType) — then enrich it with a clear Description, controlled Subjects, and RelatedIdentifiers linking to papers and sources. Discovery metadata is what makes a dataset findable; without it, even an excellent deposit is invisible to DOI infrastructure and to tools like Google Dataset Search.

What counts as discovery metadata?

Metadata splits into three jobs: descriptive/discovery (who, what, when — for finding), structural (how files relate), and administrative/technical (formats, rights, preservation). Discovery metadata is the first kind. For historians it answers: who made this corpus, what period and place does it cover, what subjects does it touch, and how does it relate to publications? Get this layer right and aggregators can surface your data to people who never knew it existed.

The DataCite mandatory core

Six properties are required to mint a DOI. Treat them as non-negotiable:

Property	Example
Identifier	`10.5281/zenodo.7654321`
Creator	`Reed, Elara`
Title	Victorian Charity Bazaar Advertisements, 1850–1890
Publisher	Zenodo
PublicationYear	2025
ResourceType	Dataset

Everything else is "recommended", but recommended fields are what actually drive discovery.

How do I write a Description that gets found?

Lead with what, when and where in plain language, then state scope and method. Avoid jargon a searcher would not type:

text

Bad:  "A curated corpus leveraging novel extraction methodologies."
Good: "1,200 charity-bazaar advertisements transcribed from English
       newspapers, 1850–1890, each with date, place, organising
       charity and verbatim text. Compiled from the British Newspaper
       Archive for research on Victorian philanthropy."

The good version contains the actual query terms — period, place, document type, theme.

A worked DataCite XML fragment

This is what a repository serves to DOI infrastructure. You rarely hand-write it, but knowing the shape helps you fill the form correctly:

xml

<resource xmlns="http://datacite.org/schema/kernel-4">
  <identifier identifierType="DOI">10.5281/zenodo.7654321</identifier>
  <creators>
    <creator>
      <creatorName>Reed, Elara</creatorName>
      <nameIdentifier nameIdentifierScheme="ORCID">0000-0002-1825-0097</nameIdentifier>
    </creator>
  </creators>
  <titles><title>Victorian Charity Bazaar Advertisements, 1850-1890</title></titles>
  <subjects>
    <subject subjectScheme="LCSH">Charities--England--History--19th century</subject>
    <subject subjectScheme="Getty AAT" valueURI="http://vocab.getty.edu/aat/300026031">advertisements</subject>
  </subjects>
  <relatedIdentifiers>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsSupplementTo">10.1234/joh.2025.42</relatedIdentifier>
  </relatedIdentifiers>
</resource>

Note the ORCID, the controlled-vocabulary subjects, and the link to the data paper.

Why do controlled subjects matter so much?

Free-text keywords scatter; controlled vocabularies cluster. Mapping your subjects to LCSH, the Getty Art & Architecture Thesaurus, or a domain thesaurus lets aggregators group your dataset with siblings and lets cross-language search work. Always carry the scheme name and, where possible, the term URI so the value is unambiguous.

Should I also add schema.org markup?

Yes, if your dataset has a landing page. DOI infrastructure reads DataCite, but Google Dataset Search reads schema.org/Dataset embedded as JSON-LD:

html

<script type="application/ld+json">
{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "name": "Victorian Charity Bazaar Advertisements, 1850-1890",
  "description": "1,200 transcribed advertisements with date, place and text.",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "creator": { "@type": "Person", "name": "Elara Reed" },
  "identifier": "https://doi.org/10.5281/zenodo.7654321"
}
</script>

Most repositories generate this for you — but only from the values you enter, so accuracy upstream is everything.

A practical end-to-end workflow

text

1. Draft the six mandatory DataCite fields.
2. Write a query-term-rich Description.
3. Map subjects to a controlled vocabulary (carry scheme + URI).
4. Add ORCIDs for creators and ROR for affiliations.
5. Link related papers/sources via RelatedIdentifier.
6. State the licence (CC BY / CC0).
7. Deposit; verify the generated DataCite + schema.org records.

Key Takeaways

Discovery metadata makes a dataset findable; structural and technical metadata do not.
Fill the six mandatory DataCite fields, then invest in the recommended ones.
Write the Description with the exact terms a searcher would type.
Use controlled subjects (LCSH, Getty AAT) with scheme names and URIs.
Add ORCIDs and RelatedIdentifiers to connect creators and publications.
Embed schema.org Dataset JSON-LD so Google Dataset Search indexes you.
Prepare all values before depositing — the upload form should be transcription.

Frequently Asked Questions

What is discovery metadata for a dataset?

Discovery metadata is the structured description that helps people and search systems find a dataset: title, creators, description, keywords, dates, subjects and identifiers. It is distinct from technical or structural metadata, which describe how files are organised.

What is the DataCite metadata schema?

DataCite is the standard schema used to register DOIs for research datasets. It defines a small set of mandatory properties such as Identifier, Creator, Title, Publisher, PublicationYear and ResourceType, plus many recommended ones like Subject and Description.

Which DataCite fields are mandatory?

Six are required: Identifier (the DOI), Creator, Title, Publisher, PublicationYear and ResourceType. Everything else, including Description, Subject and RelatedIdentifier, is recommended but strongly improves discoverability.

How do keywords and subjects improve discovery?

They connect your dataset to controlled vocabularies and to the terms people actually search. Mapping subjects to a thesaurus such as the Library of Congress Subject Headings or Getty AAT lets aggregators cluster related datasets reliably.

Should I add metadata before or after depositing?

Prepare it before depositing so the upload form is a transcription, not a drafting exercise. Most repositories auto-generate DataCite metadata from their form, so getting the source values right first ensures the DOI record is complete.

How do I make my metadata machine-readable?

Provide it as DataCite XML or JSON, and embed schema.org Dataset markup in any landing page. Aggregators like Google Dataset Search read schema.org, while DOI infrastructure reads DataCite.

What counts as discovery metadata? ​

The DataCite mandatory core ​

How do I write a Description that gets found? ​

A worked DataCite XML fragment ​

Why do controlled subjects matter so much? ​

Should I also add schema.org markup? ​

A practical end-to-end workflow ​

Key Takeaways ​

Frequently Asked Questions ​

What is discovery metadata for a dataset? ​

What is the DataCite metadata schema? ​

Which DataCite fields are mandatory? ​

How do keywords and subjects improve discovery? ​

Should I add metadata before or after depositing? ​

How do I make my metadata machine-readable? ​

Related reading ​