Appearance
When word vectors for historical meaning go wrong, the cause is almost always one of three things: a corpus too small or noisy to support stable vectors, time-slice models that were never aligned into a common space, or stochastic training treated as if it were deterministic. Fix those three and most "weird results" disappear. This guide walks the failures in the order you'll hit them.
Why are my nearest neighbours just OCR garbage?
Symptom: you query liberty and get back libcrty, 1iberty, hbcrty. Embeddings amplify rare tokens, and noisy OCR produces thousands of unique misspellings that cluster together.
Root cause and fix:
python
from gensim.models import Word2Vec
model = Word2Vec(
sentences,
vector_size=200,
window=5,
min_count=20, # <-- drop tokens seen fewer than 20 times
workers=4,
seed=42,
)Raising min_count from the default 5 to 20+ removes most OCR debris. If garbage persists, your corpus needs cleaning upstream — embeddings cannot rescue text the OCR mangled.
Why do two time-slice models give incomparable vectors?
This is the single most common error. You train one model on 1800-1850 and another on 1850-1900, then compute the cosine distance of nation between them — and get nonsense. Each word2vec run lands in its own arbitrary rotation of the vector space, so coordinates from different runs are not comparable.
The fix is orthogonal Procrustes alignment: rotate one model's matrix onto the other using their shared vocabulary, which preserves all within-model distances.
python
import numpy as np
def align(base, other, shared_words):
A = np.stack([base.wv[w] for w in shared_words])
B = np.stack([other.wv[w] for w in shared_words])
U, _, Vt = np.linalg.svd(B.T @ A)
R = U @ Vt # optimal rotation
other.wv.vectors = other.wv.vectors @ R
return otherOnly after alignment is a sentence like "broadcast moved away from seed and toward radio between 1900 and 1940" measurable.
How much text is enough?
| Tokens per slice | Recommended method | Reliability |
|---|---|---|
| < 1 million | Fine-tune a pretrained model | Train-from-scratch is unstable |
| 1-5 million | word2fastText with high min_count | Usable for frequent words |
| > 5 million | word2vec / fastText from scratch | Good |
fastText helps below the floor because subword units share statistics across rare spellings — useful for spelling-variable historical corpora.
Why is cosine similarity unstable between runs?
Word2vec is stochastic: negative sampling and shuffling differ each run. A similarity of 0.61 one run and 0.48 the next is expected, not a bug. Train 5-10 models per slice with different seeds, then report the mean similarity and its spread. Any drift signal smaller than the run-to-run spread is noise.
Static versus contextual embeddings
Use static models (word2vec, fastText) for long-run semantic change — one vector per word per period is exactly what drift analysis needs. Reach for contextual models (BERT-style) only when polysemy within a period is the question, e.g. distinguishing bank (river) from bank (money) in the same decade. Contextual pooling, layer choice and compute cost make them overkill for most decade-scale drift studies.
Validating a "meaning change" claim
Before publishing that a word's meaning shifted:
- Confirm both models were Procrustes-aligned.
- Check the word's frequency in both slices is high enough (low-frequency words have unstable vectors).
- Inspect the actual nearest neighbours by hand — the cosine number means nothing without the words behind it.
- Show the change exceeds your seed-ensemble noise band.
Key Takeaways
- Most embedding failures trace to noisy corpora, unaligned time slices, or ignored stochasticity.
- Raise
min_countto 20+ to purge OCR garbage from nearest-neighbour lists. - Always Procrustes-align separate time-slice models before comparing vectors across periods.
- Need a few million tokens per slice for from-scratch training; below that, fine-tune instead.
- Train multiple seeds and report a noise band; ignore drift smaller than that band.
- Use static embeddings for long-run change, contextual only for within-period polysemy.
- Treat analogy arithmetic as exploratory; nearest-neighbour shifts are the defensible evidence.
Frequently Asked Questions
Why are my nearest neighbours just OCR garbage?
Your corpus is too noisy or too small. Embeddings amplify low-frequency junk. Filter tokens below a minimum count (min_count=20+) and clean OCR before training.
Why do two time-slice models give incomparable vectors?
Each model is trained in its own random coordinate space. You must align them with orthogonal Procrustes before measuring drift, or distances are meaningless.
How much text do I need to train usable embeddings?
A few million tokens per time slice is a practical floor for word2vec. Below that, switch to fine-tuning a pretrained contextual model rather than training from scratch.
Static or contextual embeddings for semantic change?
Static (word2vec/fastText) are simpler and give one vector per word per period, which is ideal for drift over decades. Contextual (BERT) capture polysemy but need careful pooling and far more compute.
Why is cosine similarity unstable between runs?
Word2vec is stochastic. Train several models per slice with different seeds and average the similarities, or use the ensemble to get confidence intervals.
Can I trust an analogy like king - man + woman on historical text?
Rarely. Analogy arithmetic is fragile and corpus-dependent; treat it as exploratory, never as evidence. Nearest-neighbour shifts over time are far more defensible.