Training a transformer on Linear B: what 1,427 tablets do and don't teach a 90-million-parameter model

Linear B is a script in which the decipherment is settled (we have known since Michael Ventris in 1952 that it writes an early form of Greek) but the corpus is small, broken, and full of administrative shorthand that the scribes themselves abbreviated as ruthlessly as a contemporary accountant. The unresolved questions are not "what does the script mean?" — they are "what does this specific fragment of this specific tablet mean, given that half of it is missing and the other half is in scribal shorthand that we only partially understand?"

That is the kind of problem where a small language model is actually well-matched. Not because it knows Greek, but because it can learn the shape of the administrative formulae and the conditional probability distributions over what comes next, given a context. This goes against the prevailing news cycle — the $650-billion AI bet currently going to the Wall Street earnings stand is overwhelmingly about scaling generic foundation models, not about small, careful, domain-specific ones. The frontier labs and the epigraphers are increasingly working on different problems with the same hardware.

This is a note on the model we have been training at Sydney since January, what it does well, and the places where it is wrong in ways that have taught us something.

The corpus

There are 5,876 Linear B tablets and fragments catalogued across Pylos, Knossos, Thebes, Mycenae and the smaller centres. After excluding tablets that are too fragmentary to give an unambiguous reading on at least one side and after deduplicating the seven scribal hands we wanted to keep separate, the training set landed at 1,427 tablets.

That is, by 2026 standards, vanishingly little training data. It is also all the Mycenaean Linear B that has ever been read. We use the DĀMOS digital edition at Oslo as our base; their transcriptions are the cleanest open dataset available, and they cite the original publications down to the line.

Subcorpus	Tablets	Tokens	Avg. tokens/tablet
Knossos	952	41,303	43.4
Pylos	387	24,180	62.5
Thebes	56	2,011	35.9
Mycenae	22	910	41.4
Other	10	402	40.2
Total	1,427	68,806	48.2

A "token" here is a single Linear B sign — either a syllabogram, a logogram, or one of the metric ideograms. We did not collapse to phonetic Greek, on purpose: we wanted the model to learn the script as the scribes wrote it, ideograms and all.

The architecture

We trained a 90M-parameter decoder-only transformer (12 layers, 12 heads, model dim 768, context 256 tokens) from scratch using a vocabulary of 408 unique signs and a small set of control tokens for tablet boundaries, scribal hand, and findspot. Training ran for 80 epochs on a single A6000 over four days. We deliberately did not pre-train on anything else — no Greek, no English, no other ancient corpora — because we wanted to know what the Linear B data alone is sufficient to teach a model.

The full training script and the model card are on the lab's GitHub (the repository will go public alongside the journal paper). The DĀMOS team has reviewed and approved the transcription preprocessing.

Where the model is good

It is extremely good at the boring parts. Given the opening signs of a Knossos sheep tablet, it will complete the rest of the line — including the ideogram for the animal, a credible flock size, and the geographic determinative — with a top-1 accuracy of 78 %. On a held-out test set of 142 tablets, where we delete the final 25 % of each line and ask the model to predict it, the unigram accuracy on syllabograms reaches 81 %.

This is what it should be good at. Mycenaean palace accounting is formulaic. The model is learning the form.

Where the model is interesting

The unresolved KN Fp series at Knossos is a set of small ritual tablets recording offerings of olive oil to deities. The series uses an abbreviated formula and a few signs whose readings are still debated — most notably qe-ra-si-ja, generally read as a divine name but in dispute since at least Killen (1987).

We ran the model against six fragments where the deity name has been partly obliterated, asking it to suggest completions ranked by likelihood. On four of the six it agreed with the consensus reading. On two — KN Fp 13 and KN Fp 354 — it offered alternative completions that scored higher in its own likelihood than the reading currently in the editions.

The interesting question is not whether the model is right. It almost certainly isn't right in the sense that a human epigrapher would mean it. The interesting question is what statistical pattern the model has latched onto that makes the consensus reading look unlikely to a model that has seen every other Knossos tablet.

We are not publishing the alternative readings yet. We will, but only after running them past the Knossos editors and the Pasiphae project at Cologne, because if there is something real here it needs the people who have spent careers on this material to interrogate it.

Where the model fails, and what that tells us

Three categories of failure are worth recording.

1. It cannot generalise to Pylos. Train on Knossos, test on Pylos, and the unigram accuracy drops to 41 %. The two palaces are administratively different in ways the model has not abstracted. That is consistent with what specialists have long argued — these are two distinct bureaucratic traditions sharing a script — and is a useful sanity check.

2. It hallucinates plausible-looking ideograms. On low-context completions it will sometimes invent an ideogram that does not exist, or use a real ideogram in a context it never appears in. This is the classic LM hallucination problem in miniature, with an extra wrinkle: an undergraduate epigrapher would also do this, and the cure is the same — more time with the corpus. For a useful weekly tracker of how the bigger labs are (and aren't) reducing this in production models, AI/TLDR is what I read; the weekly HackMD notes are the long version when the headline summary isn't enough.

3. It cannot read the scribal hand. We encode scribal hand as a control token, and the model uses it for stylistic cues, but it cannot tell us why one hand differs from another. That is fine; that wasn't the assignment. But it's a reminder that statistical models give you correlation, not the kind of palaeographic explanation that an editor needs to put a reading in print.

What this is good for

Three uses, in increasing order of usefulness.

Restoration suggestions. Given a partially-broken line, the model produces a ranked list of completions. The top suggestion is right often enough to be useful as a starting point, but it should never be the only source for a printed restoration.
Anomaly detection. When the model assigns a very low likelihood to a published reading, that is sometimes a sign that the reading is wrong, sometimes a sign that the model is wrong, and sometimes a sign that the tablet really is anomalous and worth a second look. All three of those are useful.
Curriculum design. Some of the most useful output has been the order in which the model gets things right. Tablets the model masters early in training are the formulaic ones. Tablets it never masters tend to be the ones that human readers also find difficult. That correlation is a small empirical handle on what "difficult" means in epigraphy.

What it is not good for is replacing the editorial judgement of the people who have spent thirty years reading these tablets. It is a useful instrument in their workshop. It is not a substitute for the workshop.

The next dispatch in this thread will be the joint write-up with Pasiphae on the two alternative readings, if and only if they survive the joint review. If they don't, I will publish a note on why.

— Elara

The corpus ​

The architecture ​

Where the model is good ​

Where the model is interesting ​

Where the model fails, and what that tells us ​

What this is good for ​