Appearance
A working paleography glossary is a short, versioned, project-specific reference that pins down exactly how you name and tag every script feature, abbreviation and letterform you encounter. Used well, it is not a textbook to read once but a living checklist you cite in every transcription note, so two transcribers — or you on two different days — describe the same mark the same way. The single most important practice is to give each term a stable ID and a cited definition, then refer to that ID everywhere rather than retyping the word.
Why a working glossary beats a big published one
A published reference like Denis Muzerelle's Vocabulaire codicologique is authoritative but exhaustive; most projects use perhaps 5% of it. A working glossary is the curated subset you genuinely touch, with your own scope notes attached. The discipline is subtractive: every term you keep must earn its place by appearing in real transcription decisions.
yaml
# glossary.yml — one entry, versioned in git
- id: abbr.macron
term: macron (suspension stroke)
scope: "Latin, 12th–15th c."
definition: "Horizontal stroke marking omission of m or n."
source: "Muzerelle 1985, §433.07"
expansion_rule: "Expand silently; record in diplomatic layer."How do I structure each glossary entry?
Keep the schema flat and boring so it survives for years. Six fields cover almost every need: a stable id, a human-readable term, a scope (script and date range), a definition, a source citation, and an action field such as expansion_rule that tells a transcriber what to do. The action field is what separates a working glossary from a dictionary — it encodes your editorial policy, not just meaning.
Which authority should I anchor definitions to?
Pick one primary authority and one fallback, then stop. Anchoring prevents drift and makes your work defensible to reviewers.
| Need | Primary authority | Notes |
|---|---|---|
| Codicological terms | Muzerelle, Vocabulaire codicologique | Multilingual, freely online |
| English-language usage | Michelle Brown, A Guide to Western Historical Scripts | Good plates |
| Abbreviations | Cappelli, Lexicon Abbreviaturarum | The standard lookup |
| Library/archive metadata | IFLA / RDA terminology | For catalogue alignment |
Cite the section number, not just the book, so anyone can verify in seconds.
How do I keep terms consistent across a team?
Store one file, version it, and forbid free-typing. If a transcriber needs a term that is not there, they propose an addition via a pull request rather than inventing it inline. A tiny validation script catches drift before it spreads:
bash
# flag any tag in transcriptions not present in the glossary
comm -23 \
<(grep -ohE 'abbr\.[a-z]+' transcriptions/*.txt | sort -u) \
<(grep -oE '^- id: \K\S+' glossary.yml | sort -u)Any output is an undocumented term that needs a decision.
When should I retire or merge a term?
Review at every batch boundary. Count usages: a term seen fewer than three times is usually either a duplicate of an existing entry or genuinely rare and better folded into a broader category. Log the merge with a date and the surviving ID so older notes stay legible. Never delete an ID outright — mark it deprecated: true and point to its replacement, the same way you would handle a renamed code symbol.
Making the glossary machine-readable
Because the file is structured YAML with stable keys, it doubles as a controlled vocabulary for your HTR pipeline. Feed the IDs into ground-truth metadata so a Transkribus or eScriptorium export carries consistent labels, and run the validation step above in CI so a malformed entry fails the build before it reaches a transcriber.
Key Takeaways
- A working glossary is a curated, versioned subset of a published authority — not a textbook.
- Give every term a stable
idand cite that ID in notes instead of retyping words. - Each entry needs term, scope, definition, source and an action (expansion or tagging rule).
- Anchor definitions to one authority (Muzerelle, Cappelli, Brown) and record the section number.
- Validate tag usage automatically so undocumented terms surface early.
- Review at batch boundaries; merge rare terms and deprecate rather than delete IDs.
- A structured glossary doubles as a controlled vocabulary for HTR ground truth.
Frequently Asked Questions
What belongs in a working paleography glossary?
Each entry needs a term, a one-line definition, a script or period scope, a source citation and ideally an image crop. Skip terms you will never tag; a 60-term glossary you actually use beats a 600-term one you ignore.
Should I reuse a published glossary or build my own?
Anchor your definitions to a published authority such as Denis Muzerelle's Vocabulaire codicologique or the IFLA terminology, then add project-specific notes. Never silently redefine a standard term.
How do I keep glossary terms consistent across a team?
Store the glossary as a single version-controlled file (CSV or YAML), give every term a stable ID, and require transcribers to cite that ID in their notes rather than free-typing the word.
What is the difference between a brevigraph and an abbreviation mark?
A brevigraph is a single sign standing for a whole word or syllable, such as the Tironian et. An abbreviation mark, like a macron or a suspension stroke, signals that letters have been omitted from an otherwise spelled word.
How often should I review the glossary?
Review at the end of each manuscript or batch. Flag any term used fewer than three times for merge or deletion, and log every change with a date so older transcriptions remain interpretable.
Can a glossary be machine-readable for HTR projects?
Yes. A YAML or JSON glossary with stable keys lets you validate tag usage automatically and feed consistent labels into ground-truth metadata for Transkribus or eScriptorium.