How to Annotate entities efficiently

Annotate entities efficiently by having a model or gazetteer pre-label the text so your annotators correct rather than label from a blank page, by freezing a short written guideline before the first batch, and by fixing keyboard shortcuts for your label set. These three moves typically cut annotation time several-fold while improving consistency. Below is a concrete workflow, with defaults that work for historical sources where spelling, titles and abbreviations create most of the friction.

Which tool should you actually use?

For scholarly entity work, three open-source options dominate:

Tool	Strength	Best when
INCEpTION	entity linking + recommenders	you link to an authority (Wikidata, GND)
Label Studio	flexible, many data types	mixed text/image projects
doccano	lightweight, fast to deploy	quick span-only tagging

For computational history, INCEpTION is usually the right default because it supports linking spans to an authority and offers recommender-assisted pre-annotation. Pick one and commit; switching tools mid-project wastes more time than any feature gains back.

How do you set up pre-annotation?

The biggest single speed-up is never starting from blank text. Run a baseline model or a gazetteer first, import its output as suggestions, and let annotators accept or fix.

python

# produce pre-annotations to import into the annotator
import spacy, json
nlp = spacy.load("en_core_web_trf")

with open("pre_annotations.jsonl", "w", encoding="utf-8") as f:
    for doc_id, text in corpus:
        doc = nlp(text)
        spans = [{"start": e.start_char, "end": e.end_char,
                  "label": e.label_, "text": e.text} for e in doc.ents]
        f.write(json.dumps({"id": doc_id, "text": text, "spans": spans}) + "\n")

Correcting suggestions is several times faster than cold labelling, even when the baseline is mediocre — the annotator's job becomes "fix the boundary, change the label", not "find every entity".

What goes in the guideline before you start?

Keep it to one page. For each entity type, give a definition and two or three borderline examples, then add explicit rules for the things that actually cause disputes:

Span boundaries: is Sir Isaac Newton the span, or Isaac Newton? Decide once.
Titles and offices: is Bishop of Durham a PERSON, ROLE, or split?
Abbreviations: how do you tag Capt., Wm., &c.?

Run a small pilot, then revise. Trying to anticipate every case up front wastes days; a pilot surfaces the real ambiguities in an afternoon.

How do you measure that annotation is consistent?

For any dataset you will publish or train on, double-annotate a 10-15% overlap and compute inter-annotator agreement. Cohen's kappa or span-level F1 between annotators both work.

python

from sklearn.metrics import cohen_kappa_score
# token-level labels from two annotators over the overlap set
print(round(cohen_kappa_score(annot_a, annot_b), 3))  # e.g. 0.82

Below about 0.75 agreement, the guideline is ambiguous, not the annotators — fix the guideline and re-run the overlap before annotating the rest.

What span rule prevents the most arguments?

Boundary inconsistency, not label choice, is the biggest source of disagreement in historical NER. Decide and document whether honorifics and titles fall inside the span, and apply it everywhere. Sir Isaac Newton vs Isaac Newton, consistently chosen, removes more disagreement than any model improvement.

How do you keep the whole thing reproducible?

Treat annotation like data, not a one-off:

Version the guideline (it is the schema's documentation).
Export to a stable format — JSONL or CoNLL — with character offsets, never just highlighted text.
Record tool version, label set and the guideline version alongside the export.

text

release/
  guidelines_v3.md
  labels.txt          # PER ORG PLACE DATE ROLE
  annotations.jsonl   # with start/end char offsets
  agreement_v3.txt    # kappa = 0.84 over 12% overlap

Anyone should be able to reload the project and see exactly the decisions you made.

What are the most common efficiency mistakes?

Cold-labelling without pre-annotation; a 20-page guideline written before any pilot; no keyboard shortcuts (mouse-only labelling is brutally slow); and skipping the agreement check, which means you only discover ambiguity after the whole corpus is done and unfixable.

Key Takeaways

Pre-annotate with a model or gazetteer so annotators correct rather than label from scratch.
Choose one tool and commit; INCEpTION suits scholarly work that links to an authority.
Freeze a one-page guideline before the first batch, then refine after a pilot.
Double-annotate a 10-15% overlap and measure agreement; below ~0.75 the guideline is the problem.
Span-boundary consistency (titles in or out) prevents more disputes than any label decision.
Export with character offsets in a stable format and version the guideline for reproducibility.

Frequently Asked Questions

Which tool should I use to annotate historical entities?

For most projects, an open-source span annotator like Label Studio, INCEpTION or doccano covers the need. INCEpTION suits scholarly work because it supports entity linking to an authority and recommender-assisted pre-annotation out of the box.

How do I make annotation faster without losing quality?

Pre-annotate with a model or gazetteer so annotators correct rather than label from scratch, fix keyboard shortcuts for your label set, and freeze a written guideline early. Correction is several times faster than cold labelling.

How long a guideline do I need before starting?

A short one: a one-page definition per entity type with two or three borderline examples each, plus explicit rules for spans, titles and abbreviations. Refine it after a pilot round rather than trying to anticipate everything.

Should two people annotate the same documents?

For any dataset you will publish or train on, yes. Double-annotate at least a 10-15% overlap so you can measure inter-annotator agreement and catch guideline ambiguities before they contaminate the whole corpus.

What span boundary rule avoids most disputes?

Decide once whether titles and honorifics are inside the span ('Sir Isaac Newton' vs 'Isaac Newton') and write it down. Boundary inconsistency, not label choice, is the largest source of annotator disagreement in historical NER.

How do I keep annotation reproducible?

Version the guideline, export annotations in a stable format like JSONL or CoNLL with character offsets, and record tool version and label schema. Anyone should be able to reload your project and see exactly the decisions you made.

Which tool should you actually use? ​

How do you set up pre-annotation? ​

What goes in the guideline before you start? ​

How do you measure that annotation is consistent? ​

What span rule prevents the most arguments? ​

How do you keep the whole thing reproducible? ​

What are the most common efficiency mistakes? ​

Key Takeaways ​

Frequently Asked Questions ​

Which tool should I use to annotate historical entities? ​

How do I make annotation faster without losing quality? ​

How long a guideline do I need before starting? ​

Should two people annotate the same documents? ​

What span boundary rule avoids most disputes? ​

How do I keep annotation reproducible? ​

Related reading ​