When to Add metadata to a corpus

Add metadata to a corpus when a field will be used to subset, sort, cite, or filter your texts, or when a future reuser will need it to understand provenance and rights. Skip it for a throwaway exploratory pile or for attributes that are identical across every document, which you can record once at the corpus level. The deciding question is simple: will anyone query this field, or is it decoration?

When is metadata genuinely worth the cost?

Metadata is not free. Every field you add must be captured, validated, kept consistent and migrated whenever the corpus changes. That cost is justified when the field unlocks analysis you actually plan to run. If your research question is "how did vocabulary differ by region", then a region field is load-bearing and worth careful encoding. If you will never compare by region, recording it for completeness is busywork that you will then have to maintain.

What is the minimum viable metadata set?

For almost any research corpus, four fields earn their keep:

a stable identifier that never changes, so you can refer to a document unambiguously;
source and date, for citation and chronological analysis;
language, so multilingual material can be processed correctly;
rights status, so you never analyse or redistribute something you cannot.

Everything beyond this is optional and should be justified by a concrete use.

Where should the metadata live: in the files or beside them?

Keep queryable metadata in a sidecar table rather than buried in prose. A simple CSV joins cleanly to your text on the identifier:

csv

doc_id,title,author,date,language,rights
0001,A Sermon,Anon,1653,en,public-domain
0002,Travels,J. Smith,1721,en,in-copyright

You can validate it, version it in git, and load it in one line:

python

import pandas as pd
meta = pd.read_csv("metadata.csv", dtype={"doc_id": str})
subset = meta[(meta.date < 1700) & (meta.rights == "public-domain")]

Embedding the same fields inside each text file makes them harder to validate and far harder to query.

How do I decide on granularity?

Granularity should match your analytical unit. The table below shows the trade-off.

Granularity	Effort	Pays off when
Corpus-level only	very low	all documents share the attribute
Per document	moderate	you compare documents to each other
Per section/page	high	you study internal structure or zones
Per token/word	very high	you build training data or annotations

Most projects sit comfortably at per-document granularity. Reach for finer levels only when a specific method, such as page-level layout study, demands it.

What are the real trade-offs of adding more fields?

Each extra field increases three costs: capture time, the chance of inconsistency, and migration burden when the schema changes. Against that, richer metadata buys sharper subsetting, better reproducibility and easier reuse by others. The break-even point is usually "would a reasonable colleague reusing this corpus expect this field". If yes, add it; if it only satisfies your sense of tidiness, leave it out.

Should I add metadata up front or retrofit later?

Capture cheap provenance fields at ingest, while the source is still in front of you and the cost is a few seconds per item. Retrofitting source and date across ten thousand files later can take weeks, because you have to re-open each original. Conversely, analytical fields you are unsure about can wait until the research question is firm, so you do not encode distinctions you never use.

Does the schema need to follow a standard?

For a private working corpus, consistent local field names are perfectly fine. The moment you plan to publish, deposit in a repository, or let others harvest the data, adopt an interoperable vocabulary such as Dublin Core or a TEI header so machines and people elsewhere can interpret your fields without a phone call. Map your local columns to the standard at export time rather than forcing the standard on your day-to-day work.

Key Takeaways

Add a field only if it will be queried, cited, or needed to understand provenance and rights.
Four fields - identifier, source/date, language, rights - cover almost every research corpus.
Store queryable metadata in a sidecar CSV or database, not inside the prose.
Match granularity to your analytical unit; per-document suits most projects.
Capture cheap provenance at ingest; defer uncertain analytical fields until the question is firm.
Use local fields privately and map to Dublin Core or TEI only when you publish or share.

Frequently Asked Questions

When is it not worth adding metadata to a corpus?

When the corpus is a one-off exploratory dump you will discard after a week, or when every document shares identical attributes that you can record once at the corpus level instead of per file. Metadata is overhead, so only capture fields a real analysis or a reuser will query.

What is the minimum metadata a research corpus should carry?

A stable document identifier, source and date, language, and rights status. With those four you can subset reproducibly, cite accurately, and avoid using material you have no licence to share.

Should metadata live inside the text files or in a separate table?

Use a separate CSV or database for sortable, queryable fields, and embed only minimal structural markup in the text. A sidecar table is easier to validate, version and join than metadata scattered through prose.

How granular should metadata be?

Match granularity to the unit you will analyse. If you compare authors, you need author per document; if you only compare decades, a date field is enough. Over-fine metadata you never query is wasted effort.

Can I add metadata later instead of up front?

You can, but retrofitting is far more expensive once a corpus has thousands of files, and provenance details are easiest to capture at ingest while you still have the source in front of you. Capture the cheap fields early.

Does metadata need to follow a standard like Dublin Core?

For a private working corpus, consistent local fields are fine. Adopt Dublin Core or TEI headers when you plan to publish, share, or harvest the corpus into a repository where interoperability matters.

When is metadata genuinely worth the cost? ​

What is the minimum viable metadata set? ​

Where should the metadata live: in the files or beside them? ​

How do I decide on granularity? ​

What are the real trade-offs of adding more fields? ​

Should I add metadata up front or retrofit later? ​

Does the schema need to follow a standard? ​

Key Takeaways ​

Frequently Asked Questions ​

When is it not worth adding metadata to a corpus? ​

What is the minimum metadata a research corpus should carry? ​

Should metadata live inside the text files or in a separate table? ​

How granular should metadata be? ​

Can I add metadata later instead of up front? ​

Does metadata need to follow a standard like Dublin Core? ​

Related reading ​