Best Practices to Use stylometry for authorship

To use stylometry for authorship defensibly, build balanced same-length samples per candidate author, analyse the most frequent function words with Burrows's Delta, validate the method on texts of known authorship first, and report results as ranked probabilities rather than verdicts. The discipline is in the controls — sample size, genre matching and pre-registered settings — far more than in the algorithm. Treat every attribution as a hypothesis you tried hard to disprove.

Stylometry attributes authorship by measuring quantifiable stylistic fingerprints — chiefly the unconscious rates at which authors use common function words. Because nobody consciously regulates how often they write the or of, these frequencies are remarkably stable within an author and distinctive between authors.

What makes an authorship study defensible?

A defensible study is reproducible and falsifiable. That means: documented preprocessing, a fixed feature set chosen before you see results, a validation step on known-author texts, and explicit uncertainty. If you tuned the settings until the disputed text matched your favoured candidate, you have produced a result that cannot survive review.

How should I prepare my samples?

Balance is everything. The checklist:

Equal length — chunk every text to the same word count (e.g. 5,000) so longer works do not dominate.
Several samples per author — single samples cannot show within-author variation.
Matched genre and date — compare letters with letters, not letters with sermons.
Clean, consistent text — same tokenisation and case-folding everywhere.

library(stylo)
# stylo expects one file per text in a 'corpus' folder
# Sample naming convention encodes author and work:
#   Austen_Emma.txt, Austen_Persuasion.txt, Anon_Disputed.txt

The Author_Title.txt naming convention lets stylo group samples by author automatically and keeps your design transparent.

Which features and distance measure should I use?

Start with the canonical configuration and only deviate with justification.

Feature set	Distance	Good for
100-300 most frequent words	Burrows's Delta	General attribution baseline
Character 3-4 grams	Cosine Delta	Noisy OCR, spelling variation
Function words only	Delta	Cross-genre, topic-resistant
Most frequent words + culling	Eder's Delta	Larger, uneven corpora

Burrows's Delta on the 100-300 most frequent words is the field's baseline. Cosine Delta (Smith and Aldridge) often outperforms classic Delta and is more robust to corpus imbalance — a reasonable default in 2025.

How do I validate before trusting a result?

Never attribute the disputed text first. Run the pipeline on a closed set where every author is known and check it recovers the right answers:

stylo(gui = FALSE,
      mfw.min = 100, mfw.max = 300, mfw.incr = 50,
      culling.min = 0, culling.max = 0,
      distance.measure = "wurzburg",   # Cosine Delta
      analysis.type = "BCT")           # bootstrap consensus tree

If known authors cluster correctly across the MFW range, the method is calibrated for your corpus. Only then introduce the disputed text. A bootstrap consensus tree that re-samples feature ranges is far more trustworthy than a single dendrogram.

How do I report attribution honestly?

State the leading candidate, the runners-up, and the margin between them. Use rolling stylometry to detect collaboration or revision within a single text — it slides a window through the document and attributes each segment, exposing co-authored passages a whole-text Delta would hide. Always include the negative result: which candidates the method excludes is often the strongest finding.

What pitfalls undermine stylometric work?

Tiny samples under 2,000 words give unstable, unpublishable frequencies.
Topic leakage — content words let subject matter masquerade as style; cull them.
Cherry-picked settings chosen after seeing the answer destroy credibility.
Mixed editions — modernised vs original spelling silently corrupts character n-grams.
Ignoring the validation step, so you never learn the method fails on your data.

Key Takeaways

Function-word frequencies are the core authorial signal because they are unconscious.
Build equal-length, multi-sample, genre-matched sets per candidate author.
Burrows's or Cosine Delta on the 100-300 most frequent words is the baseline.
Validate on known-author texts and use bootstrap consensus trees before attributing.
Use rolling stylometry to detect collaboration and revision within a text.
Report ranked likelihoods and exclusions, never proof; pre-register your settings.

Frequently Asked Questions

What features do stylometric methods actually measure?

Most rely on the relative frequencies of the most common function words — the, of, and, but — because these are used unconsciously and resist deliberate imitation. Character n-grams and punctuation patterns are strong secondary features.

How much text do I need per author?

Aim for at least 5,000 words per sample and several samples per candidate author. Below roughly 2,000 words the function-word frequencies become unstable and attribution confidence collapses.

What is Burrows's Delta?

Burrows's Delta is the standard stylometric distance measure: it z-scores the frequencies of the most frequent words and sums the absolute differences. The candidate with the smallest Delta to a disputed text is the most likely author.

Can stylometry prove who wrote something?

No. Stylometry produces ranked likelihoods, not proof. It is strongest at excluding candidates and confirming a leading hypothesis, and should always be reported with its uncertainty and validated on known texts.

Does genre or topic interfere with authorship signal?

Strongly. A letter and a treatise by the same author can look more different than two letters by different authors. Control for genre, register and date, and culling content words helps isolate authorial style from subject matter.

Which tool should I use?

The 'stylo' package for R is the de facto standard, offering Delta, cluster analysis, bootstrap consensus trees and rolling stylometry out of the box. Python users can replicate the core methods with scikit-learn.

What makes an authorship study defensible? ​

How should I prepare my samples? ​

Which features and distance measure should I use? ​

How do I validate before trusting a result? ​

How do I report attribution honestly? ​

What pitfalls undermine stylometric work? ​

Key Takeaways ​

Frequently Asked Questions ​

What features do stylometric methods actually measure? ​

How much text do I need per author? ​

What is Burrows's Delta? ​

Can stylometry prove who wrote something? ​

Does genre or topic interfere with authorship signal? ​

Which tool should I use? ​

Related reading ​