Troubleshooting: Fine-tune BERT on historical text

When you fine-tune BERT on historical text and results disappoint, the cause is almost always one of four things: a too-high learning rate, OCR noise wrecking the tokenizer, the wrong base checkpoint, or label problems. Work through them in that order. Most "BERT does not work on my sources" reports resolve at the first or second step, not the model architecture.

Why does training collapse to one prediction?

A model that predicts the majority class for every input is the most common failure. Three usual causes:

Learning rate too high. Use 2e-5 to 5e-5; anything larger often collapses on small DH datasets.
Class imbalance. Pass class weights into the loss.
Frozen everything. Confirm the encoder is actually trainable.

python

from transformers import TrainingArguments
args = TrainingArguments(
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=4,
    warmup_ratio=0.1,
    weight_decay=0.01,
    fp16=False,   # turn back on once stable
)

Why is my loss NaN?

NaN means something numerically broke. Check, in order: a label index outside your label set, a learning rate too high, or mixed-precision overflow. Disable fp16 first — if NaN disappears, it was overflow and you can re-enable with gradient clipping.

python

# guard against out-of-range labels before training
assert max(all_labels) < num_labels, "label index exceeds num_labels"

Which base checkpoint should I start from?

This single choice often decides the project. A modern bert-base tokenises thou, quoth, archaic spellings into junk subwords. Use a domain checkpoint when one exists.

Language / period	Recommended base
Early Modern English	MacBERTh
Historical German	a historical/dbmdz German BERT
Latin	LatinBERT
Multilingual / unknown	XLM-RoBERTa, then continue pretraining

If no domain model exists, continue masked-language-model pretraining on your unlabelled historical text first, then fine-tune.

Why does BERT lose to a TF-IDF baseline?

It happens, and it is diagnostic rather than embarrassing. Two root causes:

OCR noise fragments words into meaningless subwords, so the embeddings carry little signal. Fix the upstream OCR or normalise spelling.
Domain shift is too large for a modern checkpoint. Switch base model or do continued pretraining.

Run the cheap baseline first, every time. If a logistic-regression-on-TF-IDF model already hits 0.85 F1, BERT must clearly beat that to justify its cost.

How do I handle sequences longer than 512 tokens?

Historical documents — wills, registers, depositions — routinely exceed BERT's 512-token limit. Options:

Chunk with overlap (a sliding window of 50 to 100 tokens) and aggregate.
Use a long-context model (Longformer) for document-level tasks.
For token tasks like NER, chunk at sentence boundaries to avoid splitting entities.

What does a sane debugging loop look like?

Overfit a tiny batch first. If the model cannot reach near-zero loss on 16 examples, the bug is in your data or label pipeline, not the hyperparameters.

python

# sanity check: can the model memorise 16 examples?
tiny = train_ds.select(range(16))
# train for 50 epochs on `tiny`; loss should approach 0

This one trick localises the vast majority of fine-tuning bugs before you waste GPU hours.

Key Takeaways

Use learning rate 2e-5 to 5e-5; higher rates collapse small DH datasets.
NaN loss means bad label indices, too-high rate, or fp16 overflow — check in that order.
Pick a period-appropriate base checkpoint; modern BERT mangles archaic vocabulary.
If no domain model exists, continue MLM pretraining before fine-tuning.
Always run a cheap TF-IDF baseline; OCR noise can make BERT lose to it.
Chunk documents over 512 tokens at sentence boundaries for token tasks.
Overfit 16 examples first to localise data versus hyperparameter bugs.

Frequently Asked Questions

Why does my fine-tuned BERT predict the same class for everything?

This is usually a learning-rate or class-imbalance problem. Drop the learning rate to around 2e-5, add class weights to the loss, and confirm your labels are not all the majority class in the training split.

Should I use a domain-specific BERT for historical text?

Yes when one exists for your language and period, such as MacBERTh for Early Modern English or a historical BERT for German. Domain pretraining handles archaic vocabulary that a modern checkpoint tokenises into nonsense subwords.

Why is my loss NaN during fine-tuning?

NaN loss almost always means the learning rate is too high, there are bad labels (an index outside the label set), or mixed-precision overflow. Lower the rate, validate label indices, and try disabling fp16 to isolate the cause.

How much data do I need to fine-tune BERT on historical text?

A few thousand labelled examples can work for classification, while token-level tasks like NER often need more. If data is scarce, continue pretraining on unlabelled historical text first, then fine-tune on your small labelled set.

Why does BERT do worse than a simple baseline on my historical corpus?

Heavy OCR noise and out-of-vocabulary subword fragmentation can make BERT underperform a TF-IDF baseline. Clean and normalise the text, or use a character-aware model, before concluding BERT is the wrong tool.

Why does training collapse to one prediction? ​

Why is my loss NaN? ​

Which base checkpoint should I start from? ​

Why does BERT lose to a TF-IDF baseline? ​

How do I handle sequences longer than 512 tokens? ​

What does a sane debugging loop look like? ​

Key Takeaways ​

Frequently Asked Questions ​

Why does my fine-tuned BERT predict the same class for everything? ​

Should I use a domain-specific BERT for historical text? ​

Why is my loss NaN during fine-tuning? ​

How much data do I need to fine-tune BERT on historical text? ​

Why does BERT do worse than a simple baseline on my historical corpus? ​

Related reading ​

Why does training collapse to one prediction?

Why is my loss NaN?

Which base checkpoint should I start from?

Why does BERT lose to a TF-IDF baseline?

How do I handle sequences longer than 512 tokens?

What does a sane debugging loop look like?

Key Takeaways

Frequently Asked Questions

Why does my fine-tuned BERT predict the same class for everything?

Should I use a domain-specific BERT for historical text?

Why is my loss NaN during fine-tuning?

How much data do I need to fine-tune BERT on historical text?

Why does BERT do worse than a simple baseline on my historical corpus?

Related reading