Skip to content
Text Mining & Corpora

To build a historical text corpus, decide on a research question, gather machine-readable text for each source as a UTF-8 .txt file, record provenance in a metadata CSV keyed by filename, then clean conservatively and version the result. The corpus is not the scans — it is the plain text plus the metadata table that links every file to its source. Budget a few days for a small focused corpus and weeks for anything spanning multiple archives.

A corpus differs from a document folder in one way: structure. Every decision about what to include, how to name it and what to record about it should be written down, because those decisions become the methodology section of whatever you publish.

What should I decide before collecting any text?

Three choices shape everything downstream:

  1. The question. "Did the word liberty change context in pamphlets between 1770 and 1800?" tells you the date range, genre and minimum size.
  2. The boundary. Define inclusion criteria you can apply consistently — language, date, document type, archive.
  3. The unit. Is one "document" a whole pamphlet, a single letter, or a page? This is your row in the metadata table and cannot easily change later.

Write these down before you download a single file. A corpus assembled without an inclusion rule is impossible to defend and impossible to reproduce.

How do I get machine-readable text from my sources?

Your sources arrive in three states, each with a different route to text:

Source stateRoute to textTypical effort
Born-digital (EPUB, HTML)Extract and strip markupLow
Printed scansOCR (Tesseract, ABBYY)Medium
Manuscript scansHTR (Transkribus, eScriptorium)High
Existing editions (TEI)Transform XML to plain textLow–medium

For printed material, tesseract page.tif out -l eng gives a first pass; for manuscripts you will train or apply an HTR model. Whatever the route, keep the OCR or HTR confidence data — you will need it to judge how much noise the corpus carries.

How should I name files and structure the metadata?

Pick a filename scheme that is sortable and never changes. A stable ID beats a descriptive name:

text
corpus/
  texts/
    1773_pamphlet_0001.txt
    1773_pamphlet_0002.txt
    1781_letter_0003.txt
  metadata.csv
  README.md

The metadata table is the spine of the corpus:

csv
filename,title,author,year,genre,source,licence
1773_pamphlet_0001.txt,On Liberty,Anon,1773,pamphlet,BL C.123,public-domain
1781_letter_0003.txt,Letter to J.A.,E. Reed,1781,letter,TNA SP 1/4,public-domain

The filename column is the join key linking text to metadata for the rest of the project — keep it unique and immutable.

How much cleaning should I do at build time?

Clean only what is unambiguous, and keep the raw text untouched in a separate folder. Safe build-time fixes:

python
import re

def light_clean(text: str) -> str:
    text = text.replace("", "")          # strip stray BOM
    text = re.sub(r"-\n(\w)", r"\1", text)      # rejoin hyphenated line breaks
    text = re.sub(r"[ \t]+", " ", text)          # collapse runs of spaces
    return text.strip()

Do not lowercase, remove stopwords or strip punctuation at build time. Those are analysis decisions and they vary by experiment. Bake them into your processing pipeline, not your stored corpus.

What size and balance should the corpus aim for?

Match size to method, then check balance. A corpus that is 80% one author or one decade will produce findings about that dominance, not your question. Record the distribution early:

python
import pandas as pd
meta = pd.read_csv("metadata.csv")
print(meta.groupby("year").size())
print(meta.groupby("author").size().sort_values(ascending=False).head())

If one stratum swamps the rest, either downsample it, gather more of the others, or — most honestly — report the imbalance as a limitation.

What pitfalls most often wreck a corpus?

  • Survivorship bias: your corpus is what survived and was digitised, not what was written.
  • Silent OCR noise: tokens like tbe, 1n and arid masquerade as real words.
  • Mutable filenames: renaming files after building breaks the metadata join.
  • No README: an undocumented corpus cannot be reproduced or trusted.
  • Mixed encodings: a single Latin-1 file in a UTF-8 corpus crashes downstream tools.

Key Takeaways

  • A corpus is plain text plus a metadata table, not a folder of scans.
  • Define the question, boundary and document unit before collecting anything.
  • Store one UTF-8 .txt file per document with a stable, immutable filename.
  • Keep raw text untouched; do only unambiguous cleaning at build time.
  • Size the corpus to the method and report any imbalance honestly.
  • Write a README and version the release so others can rebuild it.

Frequently Asked Questions

What is a historical text corpus?

A historical text corpus is a structured, machine-readable collection of historical documents — one plain-text file per item — paired with a metadata table that records provenance, date and source for each file. The structure is what separates a corpus from a folder of scans.

How many texts do I need for a usable corpus?

It depends on the method: concordance and frequency work is honest with a few dozen texts, while topic modelling or word vectors need hundreds of documents and several million words. Define your research question first, then size the corpus to the method.

Should I store the corpus as one big file or many small ones?

Use one plain-text file per document, named with a stable ID, plus a single metadata CSV keyed by filename. Many small files let you slice the corpus by date or genre without re-parsing one monolithic file.

What encoding and format should corpus files use?

UTF-8 plain text (.txt) is the safe default for the text itself, because it survives tool changes and version control. Keep any richer markup, such as TEI XML, as a separate derivative rather than your working analysis files.

Check the rights status of every source before redistributing text; pre-1900 material is usually public domain, but transcriptions and OCR output can carry their own database or edition rights. Record a licence field in your metadata for each item.

How do I document a corpus so others can reuse it?

Write a README that states the sources, date range, selection criteria, cleaning steps and known gaps, and version the whole corpus with a fixed release tag. Reproducibility depends on someone being able to rebuild it from your description.