Skip to content
Paleography Foundations

A working paleography glossary is a short, versioned, project-specific reference that pins down exactly how you name and tag every script feature, abbreviation and letterform you encounter. Used well, it is not a textbook to read once but a living checklist you cite in every transcription note, so two transcribers — or you on two different days — describe the same mark the same way. The single most important practice is to give each term a stable ID and a cited definition, then refer to that ID everywhere rather than retyping the word.

Why a working glossary beats a big published one

A published reference like Denis Muzerelle's Vocabulaire codicologique is authoritative but exhaustive; most projects use perhaps 5% of it. A working glossary is the curated subset you genuinely touch, with your own scope notes attached. The discipline is subtractive: every term you keep must earn its place by appearing in real transcription decisions.

yaml
# glossary.yml — one entry, versioned in git
- id: abbr.macron
  term: macron (suspension stroke)
  scope: "Latin, 12th–15th c."
  definition: "Horizontal stroke marking omission of m or n."
  source: "Muzerelle 1985, §433.07"
  expansion_rule: "Expand silently; record in diplomatic layer."

How do I structure each glossary entry?

Keep the schema flat and boring so it survives for years. Six fields cover almost every need: a stable id, a human-readable term, a scope (script and date range), a definition, a source citation, and an action field such as expansion_rule that tells a transcriber what to do. The action field is what separates a working glossary from a dictionary — it encodes your editorial policy, not just meaning.

Which authority should I anchor definitions to?

Pick one primary authority and one fallback, then stop. Anchoring prevents drift and makes your work defensible to reviewers.

NeedPrimary authorityNotes
Codicological termsMuzerelle, Vocabulaire codicologiqueMultilingual, freely online
English-language usageMichelle Brown, A Guide to Western Historical ScriptsGood plates
AbbreviationsCappelli, Lexicon AbbreviaturarumThe standard lookup
Library/archive metadataIFLA / RDA terminologyFor catalogue alignment

Cite the section number, not just the book, so anyone can verify in seconds.

How do I keep terms consistent across a team?

Store one file, version it, and forbid free-typing. If a transcriber needs a term that is not there, they propose an addition via a pull request rather than inventing it inline. A tiny validation script catches drift before it spreads:

bash
# flag any tag in transcriptions not present in the glossary
comm -23 \
  <(grep -ohE 'abbr\.[a-z]+' transcriptions/*.txt | sort -u) \
  <(grep -oE '^- id: \K\S+' glossary.yml | sort -u)

Any output is an undocumented term that needs a decision.

When should I retire or merge a term?

Review at every batch boundary. Count usages: a term seen fewer than three times is usually either a duplicate of an existing entry or genuinely rare and better folded into a broader category. Log the merge with a date and the surviving ID so older notes stay legible. Never delete an ID outright — mark it deprecated: true and point to its replacement, the same way you would handle a renamed code symbol.

Making the glossary machine-readable

Because the file is structured YAML with stable keys, it doubles as a controlled vocabulary for your HTR pipeline. Feed the IDs into ground-truth metadata so a Transkribus or eScriptorium export carries consistent labels, and run the validation step above in CI so a malformed entry fails the build before it reaches a transcriber.

Key Takeaways

  • A working glossary is a curated, versioned subset of a published authority — not a textbook.
  • Give every term a stable id and cite that ID in notes instead of retyping words.
  • Each entry needs term, scope, definition, source and an action (expansion or tagging rule).
  • Anchor definitions to one authority (Muzerelle, Cappelli, Brown) and record the section number.
  • Validate tag usage automatically so undocumented terms surface early.
  • Review at batch boundaries; merge rare terms and deprecate rather than delete IDs.
  • A structured glossary doubles as a controlled vocabulary for HTR ground truth.

Frequently Asked Questions

What belongs in a working paleography glossary?

Each entry needs a term, a one-line definition, a script or period scope, a source citation and ideally an image crop. Skip terms you will never tag; a 60-term glossary you actually use beats a 600-term one you ignore.

Should I reuse a published glossary or build my own?

Anchor your definitions to a published authority such as Denis Muzerelle's Vocabulaire codicologique or the IFLA terminology, then add project-specific notes. Never silently redefine a standard term.

How do I keep glossary terms consistent across a team?

Store the glossary as a single version-controlled file (CSV or YAML), give every term a stable ID, and require transcribers to cite that ID in their notes rather than free-typing the word.

What is the difference between a brevigraph and an abbreviation mark?

A brevigraph is a single sign standing for a whole word or syllable, such as the Tironian et. An abbreviation mark, like a macron or a suspension stroke, signals that letters have been omitted from an otherwise spelled word.

How often should I review the glossary?

Review at the end of each manuscript or batch. Flag any term used fewer than three times for merge or deletion, and log every change with a date so older transcriptions remain interpretable.

Can a glossary be machine-readable for HTR projects?

Yes. A YAML or JSON glossary with stable keys lets you validate tag usage automatically and feed consistent labels into ground-truth metadata for Transkribus or eScriptorium.