Skip to content
Text Mining & Corpora

To annotate a corpus, you attach structured labels — people, places, dates, sentiment, whatever your question needs — to spans of text so those categories become machine-readable, without changing the underlying words. The work that determines success is not the labelling itself but the scheme: clear, exclusive categories with written definitions and edge-case rules, tested on a small sample and checked for agreement between annotators before you scale. Start with a few hundred examples, measure agreement, then grow.

Annotation adds an interpretive layer on top of text. The text stays fixed; you mark which parts mean what according to a scheme you define. Done well, it turns a corpus into training data and evidence; done carelessly, it bakes inconsistency into everything built on it.

What exactly am I adding when I annotate?

You are tagging spans of text with labels. A small worked example for named entities:

text
On 4 June 1789, Mr Reed sailed from Bristol to Boston.
        [DATE]      [PERSON]   [PLACE]   [PLACE]

In standoff form — labels stored separately as character offsets, leaving the text untouched — this becomes:

text
T1  DATE    3 14    4 June 1789
T2  PERSON  19 26   Mr Reed
T3  PLACE   39 46   Bristol
T4  PLACE   50 56   Boston

The text file never changes; the annotation file points into it. This separation is what lets you re-annotate or share text and labels independently.

How do I design a scheme that holds up?

A scheme is a written document, not a vibe. The good ones share four traits:

  1. Mutually exclusive categories — a span gets exactly one label.
  2. Definitions with examples — what counts as a PLACE, with cases.
  3. Edge-case rules — is "the Crown" an ORG or a PERSON? Decide once.
  4. A default for the unclear — an explicit UNSURE label beats silent guessing.

Write the scheme before you annotate, then revise it the first time reality breaks it — and re-annotate the earlier examples to match.

Which tool should a beginner use?

Match the tool to the stage:

ToolStrengthGood for
SpreadsheetZero setupFirst 50 examples, scheme testing
doccanoSimple web UISolo or small-team labelling
Label StudioFlexible formatsMixed tasks, exports to CoNLL/JSON
INCEpTIONRich linguistic featuresMulti-annotator, curation, agreement

Begin in a spreadsheet to debug the scheme cheaply, then graduate to a real tool once you have more than one annotator or need clean exports.

How do I know my annotations are reliable?

Reliability is measured, not assumed. Have two people annotate the same sample independently, then compute inter-annotator agreement:

python
from sklearn.metrics import cohen_kappa_score
a = ["PLACE", "PERSON", "O", "DATE", "PLACE"]
b = ["PLACE", "PERSON", "O", "DATE", "O"]
print(round(cohen_kappa_score(a, b), 2))   # e.g. 0.74

As a rough guide, kappa above 0.8 is strong, 0.6–0.8 is workable, and below 0.6 signals that your categories are confusing people — fix the scheme, do not push on. Disagreements are a diagnostic, telling you exactly which definitions need sharpening.

How much should I annotate, and in what order?

Annotate in waves, not one marathon. A sensible sequence:

  1. Label ~100–200 examples and measure agreement.
  2. Revise the scheme on what broke; re-label the sample.
  3. Once agreement is acceptable, scale to the full target.

It is far cheaper to discover that PERSON and ORG blur together at 200 examples than at 5,000. Early measurement is the whole point of the small first wave.

How do I store annotations so they survive?

Use an open, documented format and keep the scheme with the data:

text
project/
  texts/            raw text, read-only
  annotations/      standoff or CoNLL files
  scheme.md         category definitions + edge cases
  agreement.csv     kappa scores per round

Standoff offsets, CoNLL columns and inline TEI are all durable, tool-independent choices. A proprietary tool's database as your only copy is a preservation risk — always export to an open format.

Key Takeaways

  • Annotation attaches machine-readable labels to spans without changing the text.
  • The scheme — clear, exclusive, well-defined categories — decides success.
  • Standoff offsets keep text and labels independent and reusable.
  • Start in a spreadsheet, move to doccano, Label Studio or INCEpTION as you scale.
  • Measure inter-annotator agreement (kappa) and treat low scores as a scheme problem.
  • Annotate in small waves first, then store in an open, documented format.

Frequently Asked Questions

What does it mean to annotate a corpus?

Annotating a corpus means adding structured labels to the text — marking which spans are people, places, dates, or which sentences are positive — so the categories you care about become machine-readable. The text stays the same; you attach an interpretive layer on top.

Do I need special software to start annotating?

No. You can begin with a spreadsheet or a plain-text scheme, then move to a dedicated tool like Label Studio, INCEpTION or doccano when consistency and multiple annotators matter. The method matters more than the tool at the start.

What makes a good annotation scheme?

Clear, mutually exclusive categories; written definitions with examples and edge cases; and a rule for every ambiguous situation you can foresee. A scheme you can apply the same way on a Tuesday and a Friday is a good scheme.

What is inter-annotator agreement and why does it matter?

It measures how often two people independently apply the same label to the same span, usually reported as Cohen's kappa. It matters because low agreement means your categories are unclear, so any model or count built on the labels is unreliable.

How much should I annotate before training a model?

Start small — a few hundred examples — to test the scheme and measure agreement, then scale up. It is far cheaper to fix a confused category at 200 examples than to discover it after annotating 5,000.

How do I store annotations so they last?

Use an open, documented format such as standoff annotations (offsets in a separate file), CoNLL, or inline TEI, and keep the scheme and version alongside the data. Avoid proprietary tool formats as your only copy.