Appearance
Annotate entities efficiently by having a model or gazetteer pre-label the text so your annotators correct rather than label from a blank page, by freezing a short written guideline before the first batch, and by fixing keyboard shortcuts for your label set. These three moves typically cut annotation time several-fold while improving consistency. Below is a concrete workflow, with defaults that work for historical sources where spelling, titles and abbreviations create most of the friction.
Which tool should you actually use?
For scholarly entity work, three open-source options dominate:
| Tool | Strength | Best when |
|---|---|---|
| INCEpTION | entity linking + recommenders | you link to an authority (Wikidata, GND) |
| Label Studio | flexible, many data types | mixed text/image projects |
| doccano | lightweight, fast to deploy | quick span-only tagging |
For computational history, INCEpTION is usually the right default because it supports linking spans to an authority and offers recommender-assisted pre-annotation. Pick one and commit; switching tools mid-project wastes more time than any feature gains back.
How do you set up pre-annotation?
The biggest single speed-up is never starting from blank text. Run a baseline model or a gazetteer first, import its output as suggestions, and let annotators accept or fix.
python
# produce pre-annotations to import into the annotator
import spacy, json
nlp = spacy.load("en_core_web_trf")
with open("pre_annotations.jsonl", "w", encoding="utf-8") as f:
for doc_id, text in corpus:
doc = nlp(text)
spans = [{"start": e.start_char, "end": e.end_char,
"label": e.label_, "text": e.text} for e in doc.ents]
f.write(json.dumps({"id": doc_id, "text": text, "spans": spans}) + "\n")Correcting suggestions is several times faster than cold labelling, even when the baseline is mediocre — the annotator's job becomes "fix the boundary, change the label", not "find every entity".
What goes in the guideline before you start?
Keep it to one page. For each entity type, give a definition and two or three borderline examples, then add explicit rules for the things that actually cause disputes:
- Span boundaries: is
Sir Isaac Newtonthe span, orIsaac Newton? Decide once. - Titles and offices: is
Bishop of Durhama PERSON, ROLE, or split? - Abbreviations: how do you tag
Capt.,Wm.,&c.?
Run a small pilot, then revise. Trying to anticipate every case up front wastes days; a pilot surfaces the real ambiguities in an afternoon.
How do you measure that annotation is consistent?
For any dataset you will publish or train on, double-annotate a 10-15% overlap and compute inter-annotator agreement. Cohen's kappa or span-level F1 between annotators both work.
python
from sklearn.metrics import cohen_kappa_score
# token-level labels from two annotators over the overlap set
print(round(cohen_kappa_score(annot_a, annot_b), 3)) # e.g. 0.82Below about 0.75 agreement, the guideline is ambiguous, not the annotators — fix the guideline and re-run the overlap before annotating the rest.
What span rule prevents the most arguments?
Boundary inconsistency, not label choice, is the biggest source of disagreement in historical NER. Decide and document whether honorifics and titles fall inside the span, and apply it everywhere. Sir Isaac Newton vs Isaac Newton, consistently chosen, removes more disagreement than any model improvement.
How do you keep the whole thing reproducible?
Treat annotation like data, not a one-off:
- Version the guideline (it is the schema's documentation).
- Export to a stable format — JSONL or CoNLL — with character offsets, never just highlighted text.
- Record tool version, label set and the guideline version alongside the export.
text
release/
guidelines_v3.md
labels.txt # PER ORG PLACE DATE ROLE
annotations.jsonl # with start/end char offsets
agreement_v3.txt # kappa = 0.84 over 12% overlapAnyone should be able to reload the project and see exactly the decisions you made.
What are the most common efficiency mistakes?
Cold-labelling without pre-annotation; a 20-page guideline written before any pilot; no keyboard shortcuts (mouse-only labelling is brutally slow); and skipping the agreement check, which means you only discover ambiguity after the whole corpus is done and unfixable.
Key Takeaways
- Pre-annotate with a model or gazetteer so annotators correct rather than label from scratch.
- Choose one tool and commit; INCEpTION suits scholarly work that links to an authority.
- Freeze a one-page guideline before the first batch, then refine after a pilot.
- Double-annotate a 10-15% overlap and measure agreement; below ~0.75 the guideline is the problem.
- Span-boundary consistency (titles in or out) prevents more disputes than any label decision.
- Export with character offsets in a stable format and version the guideline for reproducibility.
Frequently Asked Questions
Which tool should I use to annotate historical entities?
For most projects, an open-source span annotator like Label Studio, INCEpTION or doccano covers the need. INCEpTION suits scholarly work because it supports entity linking to an authority and recommender-assisted pre-annotation out of the box.
How do I make annotation faster without losing quality?
Pre-annotate with a model or gazetteer so annotators correct rather than label from scratch, fix keyboard shortcuts for your label set, and freeze a written guideline early. Correction is several times faster than cold labelling.
How long a guideline do I need before starting?
A short one: a one-page definition per entity type with two or three borderline examples each, plus explicit rules for spans, titles and abbreviations. Refine it after a pilot round rather than trying to anticipate everything.
Should two people annotate the same documents?
For any dataset you will publish or train on, yes. Double-annotate at least a 10-15% overlap so you can measure inter-annotator agreement and catch guideline ambiguities before they contaminate the whole corpus.
What span boundary rule avoids most disputes?
Decide once whether titles and honorifics are inside the span ('Sir Isaac Newton' vs 'Isaac Newton') and write it down. Boundary inconsistency, not label choice, is the largest source of annotator disagreement in historical NER.
How do I keep annotation reproducible?
Version the guideline, export annotations in a stable format like JSONL or CoNLL with character offsets, and record tool version and label schema. Anyone should be able to reload your project and see exactly the decisions you made.