Troubleshooting: Crowdsource tagging and classification

Q: Why do my volunteers disagree on tags so often?

High disagreement almost always traces to a vague or overlapping tag vocabulary, not to careless volunteers. Tighten definitions, add examples, and remove categories that mean the same thing in practice.

Q: What inter-annotator agreement score is good enough?

For loose subjective tags, Krippendorff's alpha above 0.67 is workable and above 0.80 is strong. For factual classifications you should expect 0.80 or higher; lower means the task or guidance is broken.

Q: How many people should tag each item?

Three independent taggers is the practical default for classification. It lets you take a majority vote and flags genuine ambiguity when all three disagree.

Q: Why is one tag chosen far more than the others?

A dominant tag usually signals it is the default or first option, or that volunteers use it as an escape hatch. Reorder options, remove an obvious default, and add a clear 'unsure' path.

Q: Should I let volunteers add free-text tags?

Allow free text only as a supplementary field alongside a controlled list. Free-form tagging alone produces sparse, inconsistent vocabularies that need heavy reconciliation later.

Q: How do I fix tags that are already wrong at scale?

Cluster the existing tags in OpenRefine, map the variants to canonical terms, and re-expose only the genuinely ambiguous items for a second pass rather than re-tagging everything.

When crowdsourced tagging and classification goes wrong, the symptoms are predictable: volunteers disagree wildly, one tag swamps the others, or your final dataset is a mess of near-synonyms. The root cause is almost never careless people — it is an ambiguous vocabulary, a poorly ordered interface, or missing examples. Fix the task design first, measure inter-annotator agreement to confirm, and only then reconcile the data you already have.

Why do volunteers disagree on tags so often?

Disagreement is diagnostic. Pull your task and ask: could two careful, well-meaning people legitimately choose different tags for the same item? If yes, the vocabulary is the bug. Overlapping categories like "portrait" and "person," or subjective ones like "interesting," guarantee scatter. Before blaming contributors, run a small audit: take 30 items, have three people tag them, and look at where they split. The splits cluster around a handful of ambiguous categories nearly every time.

How do I measure whether agreement is acceptable?

Quantify it instead of guessing. Krippendorff's alpha handles multiple raters and missing data, which fits crowdsourcing well.

python

import krippendorff
# rows = raters, columns = items; use np.nan for "not seen"
reliability = [
    [1, 2, 1, 3, 2],
    [1, 2, 2, 3, 2],
    [1, 1, 1, 3, 2],
]
alpha = krippendorff.alpha(reliability_data=reliability,
                           level_of_measurement="nominal")
print(f"alpha = {alpha:.2f}")

Read it against task type:

Task type	Acceptable alpha	Action below threshold
Factual classification	`>= 0.80`	Rewrite guidance, add examples
Subjective tags	`0.67 - 0.80`	Narrow vocabulary, accept some noise
Anything `< 0.67`	not usable	Redesign the task before continuing

A low alpha is not a reason to throw out volunteers — it is a signal your instrument needs repair.

Why is one tag chosen far more than all the others?

A wildly skewed distribution has three usual causes. The tag is the first or default option, so satisficing volunteers pick it. There is no honest "unsure" route, so people dump uncertainty into the safest-looking label. Or the tag is genuinely over-broad and absorbs everything. The fixes are cheap: randomise option order, never pre-select a value, and add an explicit "can't tell / not applicable" choice so ambiguity becomes data instead of contamination.

What does a robust tagging workflow look like?

Build redundancy in from the start rather than retrofitting it:

Assign each item to three independent taggers.
Take the majority tag when at least two of three agree.
Route items where all three disagree to an expert queue, not the bin.
Log every raw vote so you can recompute consensus if rules change.
Periodically re-measure alpha as the corpus and volunteer pool evolve.

This turns disagreement into a routing signal: easy items resolve automatically, and human review concentrates only on the genuinely hard cases.

How do I fix tags that are already wrong at scale?

Do not re-tag everything. Export the votes and reconcile:

Load the tags into OpenRefine and use key-collision clustering to merge variants like ship, Ship, sailing ship into one canonical term.
Map free-text contributions onto your controlled vocabulary with a crosswalk table.
Recompute consensus from the raw votes, not the messy aggregated field.
Re-expose only the still-ambiguous items for a focused second pass.

This recovers most of the value of existing work and reserves human effort for items that actually need it.

When should you allow free-text tags?

Free text is tempting and almost always backfires as a primary mechanism: you get hundreds of one-off spellings and no comparability. Allow it only as a supplementary field beside a controlled list, where it serves as a suggestion box for vocabulary you missed — review those suggestions periodically and promote the good ones into the official list.

Key Takeaways

Most tagging disagreement is a vocabulary problem, not a volunteer problem.
Measure inter-annotator agreement with Krippendorff's alpha and act on it.
Three independent taggers plus majority vote is a solid default.
Randomise option order and always offer an explicit "unsure" path.
Reconcile existing tags in OpenRefine instead of re-tagging from scratch.
Keep raw votes so you can recompute consensus when rules change.
Treat free text as a supplement to a controlled list, never a replacement.

Frequently Asked Questions

Why do my volunteers disagree on tags so often?

High disagreement almost always traces to a vague or overlapping tag vocabulary, not to careless volunteers. Tighten definitions, add examples, and remove categories that mean the same thing in practice.

What inter-annotator agreement score is good enough?

For loose subjective tags, Krippendorff's alpha above 0.67 is workable and above 0.80 is strong. For factual classifications you should expect 0.80 or higher; lower means the task or guidance is broken.

How many people should tag each item?

Three independent taggers is the practical default for classification. It lets you take a majority vote and flags genuine ambiguity when all three disagree.

Why is one tag chosen far more than the others?

A dominant tag usually signals it is the default or first option, or that volunteers use it as an escape hatch. Reorder options, remove an obvious default, and add a clear "unsure" path.

Should I let volunteers add free-text tags?

Allow free text only as a supplementary field alongside a controlled list. Free-form tagging alone produces sparse, inconsistent vocabularies that need heavy reconciliation later.

How do I fix tags that are already wrong at scale?

Cluster the existing tags in OpenRefine, map the variants to canonical terms, and re-expose only the genuinely ambiguous items for a second pass rather than re-tagging everything.

Why do volunteers disagree on tags so often? ​

How do I measure whether agreement is acceptable? ​

Why is one tag chosen far more than all the others? ​

What does a robust tagging workflow look like? ​

How do I fix tags that are already wrong at scale? ​

When should you allow free-text tags? ​

Key Takeaways ​

Frequently Asked Questions ​

Why do my volunteers disagree on tags so often? ​

What inter-annotator agreement score is good enough? ​

How many people should tag each item? ​

Why is one tag chosen far more than the others? ​

Should I let volunteers add free-text tags? ​

How do I fix tags that are already wrong at scale? ​

Related reading ​

Why do volunteers disagree on tags so often?

How do I measure whether agreement is acceptable?

Why is one tag chosen far more than all the others?

What does a robust tagging workflow look like?

How do I fix tags that are already wrong at scale?

When should you allow free-text tags?

Key Takeaways

Frequently Asked Questions

Why do my volunteers disagree on tags so often?

What inter-annotator agreement score is good enough?

How many people should tag each item?

Why is one tag chosen far more than the others?

Should I let volunteers add free-text tags?

How do I fix tags that are already wrong at scale?

Related reading