How to Start with distant reading

To start with distant reading, assemble a plain-text corpus with a metadata table, clean the text and decide on tokenisation, then run one simple measurement — usually word frequencies over time — before reaching for anything statistical. The honest first result is a frequency chart you can trace back to real documents, not a topic model you cannot explain. Plan on a weekend to get from raw files to a defensible first chart.

Distant reading means studying literature or history at the scale of hundreds or thousands of texts by counting and modelling features, rather than reading each one closely. The phrase comes from Franco Moretti's 2000 essay Conjectures on World Literature. The point is not to abandon close reading but to ask questions no human could answer by reading alone.

What exactly is distant reading, and what is it good for?

Distant reading trades depth for breadth. Where close reading interprets a single passage in detail, distant reading detects patterns across a corpus: which words rise and fall, which texts cluster by style, which themes co-occur. It answers comparative and diachronic questions — did sentiment in war diaries shift after 1916? — that no single document can settle.

It is poor at nuance, irony and one-off meaning. Treat every numeric result as a prompt to go back and read the texts driving it.

What do I need before I write any code?

Three things, in this order:

A corpus as one plain-text file per document, UTF-8 encoded.
A metadata table (CSV) keyed by filename: author, date, genre, source.
A research question narrow enough to falsify.

A minimal metadata file:

csv

filename,author,year,genre
diary_001.txt,Anon,1915,diary
diary_002.txt,Anon,1916,diary
letter_044.txt,E. Reed,1917,letter

Keep filenames stable — they are the join key between text and metadata for the rest of the project.

How do I clean the text without destroying it?

Cleaning is where most projects quietly go wrong. The defaults below are safe for English-language sources:

python

import re

def normalise(text):
    text = text.lower()
    text = re.sub(r"-\n", "", text)        # rejoin hyphenated line breaks
    text = re.sub(r"\s+", " ", text)        # collapse whitespace
    text = re.sub(r"[^\w\s']", " ", text)   # keep apostrophes for contractions
    return text.strip()

Resist over-cleaning. Stripping all punctuation before sentence segmentation, or removing stopwords before you have looked at them, throws away signal you may later want. Always keep a copy of the raw text.

Which tool should I open first?

For a first look, nothing beats Voyant Tools — paste a folder of texts and it returns frequencies, trends, collocations and a Cirrus word cloud in your browser. When you need reproducibility and version control, move to Python.

Tool	Setup	Best first use	Limit
Voyant Tools	None (web)	Instant exploration	Not reproducible
AntConc	Download	Concordances, keywords	Manual, GUI-only
Python + NLTK	`pip install nltk`	Scripted, repeatable	Steeper curve
R + tidytext	`install.packages`	Tidy frequency work	R syntax

How do I produce a first defensible result?

Do the simplest thing that answers your question. Word frequency over time is the canonical starting point:

python

import pandas as pd
from collections import Counter

meta = pd.read_csv("metadata.csv")
rows = []
for _, r in meta.iterrows():
    words = normalise(open(r.filename, encoding="utf-8").read()).split()
    counts = Counter(words)
    rows.append({"year": r.year, "freq": counts["fear"] / len(words)})

pd.DataFrame(rows).groupby("year").freq.mean().plot()

Normalise by document length (relative frequency), never raw counts — otherwise longer texts dominate. Then trace the spike back to specific documents and read them.

What pitfalls catch beginners first?

OCR noise masquerading as a trend — tbe, arid, 1n will pollute your counts.
Raw counts instead of relative frequencies, so long texts swamp short ones.
Survivorship bias: your corpus is what survived and was digitised, not what was written.
Modelling too early — a topic model on 30 texts is theatre, not analysis.

Key Takeaways

Distant reading scales analysis to whole corpora; it complements, not replaces, close reading.
Prepare plain UTF-8 texts plus a metadata CSV keyed by filename before coding.
Clean conservatively and always keep the raw text.
Start in Voyant for exploration, move to Python or R for reproducibility.
Make your first result a relative-frequency chart you can trace to documents.
Watch for OCR noise, survivorship bias and premature statistical modelling.

Frequently Asked Questions

What is distant reading in one sentence?

Distant reading is the practice of analysing many texts at once by counting, measuring and modelling features across a whole corpus, instead of close-reading each document individually. The term was coined by Franco Moretti in 2000.

How big does a corpus need to be for distant reading?

There is no hard minimum, but methods like topic modelling and word vectors need hundreds of documents and millions of words to behave sensibly. With under 50 texts, simple frequency and concordance work is more honest than statistical modelling.

What software should a beginner use first?

Start with Voyant Tools in a browser for instant exploration, then move to Python with NLTK or spaCy when you need reproducibility. Voyant requires no installation and surfaces frequencies, trends and collocations in minutes.

Do I need to read the texts before distant reading them?

Yes. Distant reading complements close reading rather than replacing it; you need enough familiarity with a sample to interpret what the numbers mean and to spot when a result is an OCR artefact rather than a real pattern.

How do I know my pattern is real and not noise?

Check it against a held-out subset of the corpus, plot it over time to see if it is stable, and trace the top contributing documents back to their text. A pattern that vanishes when you change tokenisation settings was probably noise.

Is distant reading only for literature?

No. Historians use it on parliamentary debates, archivists on finding aids, and curators on catalogue records. Any large, machine-readable text collection with consistent metadata is a candidate.

What exactly is distant reading, and what is it good for? ​

What do I need before I write any code? ​

How do I clean the text without destroying it? ​

Which tool should I open first? ​

How do I produce a first defensible result? ​

What pitfalls catch beginners first? ​

Key Takeaways ​

Frequently Asked Questions ​

What is distant reading in one sentence? ​

How big does a corpus need to be for distant reading? ​

What software should a beginner use first? ​

Do I need to read the texts before distant reading them? ​

How do I know my pattern is real and not noise? ​

Is distant reading only for literature? ​

Related reading ​