Best Practices to Choose transformer vs rule-based NLP

Choose rule-based NLP when the target is a closed, regular pattern you can fully describe, you need auditable and reproducible output, and you have little or no training data. Choose a transformer when the task is fuzzy, context-dependent, or open-vocabulary, such as recognising people and places across variable spelling, and you can supply a few hundred annotated examples. In practice the strongest historical pipelines are hybrid: a transformer for the messy judgement calls, deterministic rules for validation and normalisation.

What problem are you actually solving?

Before comparing methods, classify the task. Extracting a regnal year from "in the third year of the reign of Henry VI" is a bounded problem with finite phrasing and a verifiable answer. Identifying which "Mr. Smith" a letter refers to across a 4,000-document corpus is unbounded and context-heavy. Rules excel at the first; transformers excel at the second.

A quick litmus test: if you can imagine writing the extraction logic as a flowchart in an afternoon, rules will probably outperform a model and cost less to maintain.

How do the two approaches compare on the dimensions that matter?

Dimension	Rule-based	Transformer
Training data needed	None	~200-500 labelled sentences
Reproducibility	Deterministic	Needs pinned weights/seed
Handles spelling variation	Poorly without lexica	Strongly
Auditability	High (read the rule)	Low (opaque weights)
Upfront cost	Analyst time	Annotation + GPU
Marginal cost per page	Near zero	Low but nonzero
Failure mode	Misses unseen patterns	Plausible-looking errors

The most dangerous quadrant is a transformer producing confident, well-formed, wrong output, for example a hallucinated death date. Rules fail loudly; transformers fail quietly.

Why not always use a transformer?

Three reasons rooted in archival reality. First, provenance: a reviewer can ask "why did the system tag this token as a place?" A rule answers "it matched /-shire$/ in the gazetteer"; a model answers "the weights said so." Second, drift: upstream model updates change outputs between runs unless you freeze the version. Third, scale of variation: with only a handful of documents, there is not enough signal to fine-tune, and few-shot prompting plus rules is more controllable.

When does a transformer clearly win?

When variation overwhelms enumeration. Consider normalising early modern English where "love" appears as loue, louue, lufe. A finite rule set fights a losing battle against the long tail, whereas a model trained on aligned pairs generalises. Likewise for syntactic tasks like dependency parsing of inflected Latin, where word order is free and rules become unmanageable.

A practical decision checklist

Work through this in order and stop at the first decisive answer:

text

1. Is the output finite and enumerable?        -> rules
2. Do you have <100 labelled examples?         -> rules or few-shot
3. Is auditability legally/editorially required -> rules for that step
4. Is spelling/word-order variation high?       -> transformer
5. Do you have 200+ labelled examples + GPU?    -> fine-tune transformer
6. Mix of the above?                            -> hybrid (most projects)

A minimal hybrid skeleton in spaCy makes the layering explicit:

python

import spacy

nlp = spacy.load("en_core_web_trf")  # transformer for NER

def validate_dates(doc):
    for ent in doc.ents:
        if ent.label_ == "DATE":
            # deterministic guardrail: reject impossible years
            year = extract_year(ent.text)
            if year and not (1000 <= year <= 1900):
                ent._.is_valid = False
    return doc

# transformer proposes, rules dispose

How do you keep the choice defensible across a whole collection?

Document the decision per task, not per project. Record which steps are rule-based, which are model-based, the model version and seed, and the validation rules. Store this alongside the data so a future curator can reproduce or challenge any single extraction. Re-run a held-out gold set after any dependency upgrade to catch silent drift.

Key Takeaways

Match the method to the task shape: bounded and regular favours rules, fuzzy and open-vocabulary favours transformers.
Hybrid pipelines, transformer-then-rules, give you generalisation and guardrails.
Rules fail loudly and are auditable; transformers fail quietly with plausible errors.
Below ~100 annotated examples, rules or few-shot prompting usually beat fine-tuning.
Pin model version, seed, and tokeniser to keep transformer output reproducible.
Compare total cost over your full page count, not per-document convenience.
Always validate model output with deterministic sanity checks before publishing.

Frequently Asked Questions

When is a rule-based approach better than a transformer for historical text?

Rules win when the pattern is closed and regular, such as regnal dates, currency notations, or a fixed list of place spellings. They are auditable, need no training data, and never hallucinate, so they are the safer default for high-stakes structured extraction.

Do transformers need a GPU for historical NLP?

For inference on a few thousand pages a modern CPU is usually fine, just slow. Fine-tuning a BERT-style model is far more comfortable with a GPU, but free Colab or Kaggle GPUs handle most humanities-scale corpora in under an hour.

Can I combine rules and transformers in one pipeline?

Yes, and hybrid pipelines are often the best choice. A common pattern is a transformer for fuzzy tasks like named-entity recognition, followed by deterministic rules that validate, normalise, and reject impossible outputs such as a date in the year 3000.

How much annotated data does a transformer need for historical text?

Fine-tuning a pretrained model for a sequence-labelling task is often viable with 200 to 500 carefully annotated sentences, far less than training from scratch. Below roughly 100 examples, rules or few-shot prompting usually beat a fine-tuned model.

Are transformers reproducible enough for scholarly publication?

They can be, if you pin the model version, random seed, library versions, and tokeniser, and archive the exact weights. Without that, model updates silently change outputs, which is why many editions still prefer rules for the load-bearing extraction steps.

What about cost over a whole collection?

Rules have near-zero marginal cost once written but high upfront analyst time. Transformers invert that: low per-page cost after setup but real GPU, annotation, and maintenance overhead. Estimate both over your full page count before committing.

What problem are you actually solving? ​

How do the two approaches compare on the dimensions that matter? ​

Why not always use a transformer? ​

When does a transformer clearly win? ​

A practical decision checklist ​

How do you keep the choice defensible across a whole collection? ​

Key Takeaways ​

Frequently Asked Questions ​

When is a rule-based approach better than a transformer for historical text? ​

Do transformers need a GPU for historical NLP? ​

Can I combine rules and transformers in one pipeline? ​

How much annotated data does a transformer need for historical text? ​

Are transformers reproducible enough for scholarly publication? ​

What about cost over a whole collection? ​

Related reading ​