Do stylometry in R with stylo: A Practical Guide

Q: What file structure does the stylo package expect?

Put plain-text files in a subfolder named corpus/ inside your working directory, named author_title.txt. stylo() reads that folder automatically and parses the author from the part before the first underscore.

Q: Can stylo do rolling analysis to find collaboration within one text?

Yes. rolling.classify() slides a window across a single document and classifies each segment, which can reveal where one hand stops and another begins in a co-authored or interpolated work.

To do stylometry in R, install the stylo package, drop your plain-text documents into a corpus/ folder named author_title.txt, and run stylo() — a single call that tokenises, counts the most frequent words, computes Burrows's Delta, and draws a cluster dendrogram. The method works because authors use common function words ("the", "of", "and") in stable, unconscious proportions that fingerprint their style. This guide takes a historian's corpus from raw files to a defensible attribution.

How do I set up the corpus folder?

stylo is convention-driven. It reads every .txt file in a corpus/ subfolder of your working directory and infers the author label from the text before the first underscore.

project/
└── corpus/
    ├── Defoe_Crusoe.txt
    ├── Defoe_Roxana.txt
    ├── Swift_Tale.txt
    └── Anon_PamphletX.txt

install.packages("stylo")
library(stylo)
setwd("project")          # the folder that CONTAINS corpus/

Files must be plain UTF-8 text. Strip front matter, page numbers and editorial apparatus first — they are not the author's words and will skew the function-word counts.

How do I run a first analysis?

Call stylo() with no arguments for the interactive GUI, or set parameters directly for a reproducible script.

results <- stylo(
  gui = FALSE,
  corpus.dir = "corpus",
  analyzed.features = "w",        # words
  ngram.size = 1,
  mfw.min = 100, mfw.max = 100,   # 100 most-frequent words
  culling.min = 0, culling.max = 0,
  distance.measure = "dist.delta" # Burrows's Delta
)

The output dendrogram groups texts by stylistic similarity. If the anonymous pamphlet clusters tightly with the Defoe files across several settings, that is evidence — not proof — of authorship.

How do I know the result is not an artefact?

Sweep the parameters. A single MFW value can produce a flattering grouping by chance; a real signal survives variation.

Parameter	What to sweep	Why
MFW	100, 200, 300, 500	Stable grouping across MFW is convincing
Culling	0%, 20%, 50%	Tests reliance on rare shared words
Distance	Delta, Cosine Delta, Eder's Delta	Method-robustness check
N-gram	1 word vs 2-3 chars	Word vs sub-word signal

stylo's bootstrap consensus tree (analysis.type = "BCT") automates this by aggregating many MFW settings into one robust tree — use it for the figure you publish.

What about verification rather than clustering?

When you have a candidate author and want a yes/no, use the General Imposters method via imposters(). It compares the disputed text against the candidate and a pool of distractor authors, returning a probability rather than a suggestive cluster. This is the current best practice for single-author verification and far more defensible than reading a dendrogram by eye.

Can I find where collaborators switch within one text?

Yes — rolling.classify() slides a window across a single document and classifies each segment against known authors, exposing seams in co-authored or interpolated works.

rolling.classify(
  training.set = "reference_authors",
  test.set = "disputed_text.txt",
  slice.size = 5000, slice.overlap = 4500,
  classification.method = "delta"
)

The resulting strip plot shows where the attributed hand changes, which is invaluable for interpolations in manuscripts and pamphlets.

What are the limits a historian must report?

Translation measures the translator; heavy OCR noise corrupts function-word counts; and texts under ~2,000 words give unstable results. State your text lengths, OCR error rate and parameter sweep alongside any claim. Stylometry produces evidence weighted by these caveats, never a verdict on its own.

Key Takeaways

stylo reads author_title.txt files from a corpus/ folder automatically.
Burrows's Delta on the most frequent words is the default, well-tested method.
Strip editorial apparatus and page numbers — only the author's words should count.
Sweep MFW, culling and distance; a real signal survives parameter variation.
Use the bootstrap consensus tree for the figure you actually publish.
For yes/no attribution, prefer the General Imposters method over reading a dendrogram.
Report text lengths, OCR error and translation status — they bound every claim.

Frequently Asked Questions

What file structure does the stylo package expect?

Put plain-text files in a subfolder named corpus/ inside your working directory, named author_title.txt. stylo() reads that folder automatically and parses the author from the part before the first underscore.

How many most-frequent words (MFW) should I use?

Burrows's Delta typically uses the 100 to 1000 most frequent words. Start with a culling of 0 and MFW around 100-300; sweep the range to confirm your grouping is stable rather than an artefact of one setting.

Does stylometry work on translated or OCR'd texts?

Cautiously. Translation imposes the translator's style, so you may be measuring the translator, not the author. Heavy OCR noise distorts function-word frequencies, so clean the text and report error rates before trusting any attribution.

What is Burrows's Delta?

Burrows's Delta is a distance measure based on the z-scores of the most frequent words. Texts by the same author tend to have a small Delta distance; it is the default and most-tested method in stylo.

How long do texts need to be for reliable stylometry?

Function-word frequencies stabilise around a few thousand words. Below roughly 2,000 words results get noisy; for short documents combine related pieces or treat conclusions as tentative.

Can stylo do rolling analysis to find collaboration within one text?

Yes. rolling.classify() slides a window across a single document and classifies each segment, which can reveal where one hand stops and another begins in a co-authored or interpolated work.

How do I set up the corpus folder? ​

How do I run a first analysis? ​

How do I know the result is not an artefact? ​

What about verification rather than clustering? ​

Can I find where collaborators switch within one text? ​

What are the limits a historian must report? ​

Key Takeaways ​

Frequently Asked Questions ​

What file structure does the stylo package expect? ​

How many most-frequent words (MFW) should I use? ​

Does stylometry work on translated or OCR'd texts? ​

What is Burrows's Delta? ​

How long do texts need to be for reliable stylometry? ​

Can stylo do rolling analysis to find collaboration within one text? ​

Related reading ​