Skip to content
Text Mining & Corpora

Build a concordance when your question is about how particular words or phrases are used — their senses, collocates and shifts over time — and your corpus is too large to read in full but the term itself appears a manageable number of times. Avoid it when the question is about whole-document patterns (reach for topic modelling or classification instead) or when the term occurs so often that the KWIC view is unreadable. KWIC is a precision instrument for word-level questions, not a general survey tool.

A keyword-in-context concordance aligns every occurrence of a term with a span of words on each side, so you can read usage across thousands of pages without opening every document. The decision to build one is really a decision about whether your question is word-shaped.

When does a concordance genuinely fit the question?

KWIC earns its place when all three hold:

  • The question is about specific words or phrases, not document themes.
  • The corpus is too large to close-read in full.
  • The target term's instances number in the hundreds, not tens of thousands.

Classic fits: tracing how commerce shifts sense across a century, checking whether liberty collocates with property or conscience, or auditing every mention of a place name. These are precisely the questions no whole-document method answers well.

When should I reach for a different method instead?

Your questionBetter tool than KWIC
What themes run through these documents?Topic modelling
Which documents are about trade?Text classification
What words define this period?Keyword analysis
How long are these texts, by decade?Plain metadata stats
Where does term X appear in context?KWIC concordance

If the answer to your question is a property of whole documents, a concordance forces you to read word-by-word what you could have aggregated — a poor use of attention.

How wide should the context window be?

Five to ten words each side covers most needs. A worked minimal KWIC in Python:

python
def kwic(tokens, term, span=7):
    hits = []
    for i, tok in enumerate(tokens):
        if tok.lower() == term.lower():
            left = " ".join(tokens[max(0, i-span):i])
            right = " ".join(tokens[i+1:i+1+span])
            hits.append((left, tok, right))
    return hits

text = open("derived/corpus.txt", encoding="utf-8").read().split()
for l, k, r in kwic(text, "liberty")[:20]:
    print(f"{l:>50} | {k} | {r}")

Widen the span when you study argument or syntax; narrow it to the immediate neighbours when you only care about collocates.

Do concordances survive dirty OCR and spelling variation?

Only as far as your search terms reach. KWIC finds exactly the strings you ask for, so labour, labor and labovr (an OCR slip) are three separate searches. On uncleaned text a concordance silently under-counts:

python
variants = ["liberty", "libertie", "libertye"]
all_hits = [h for v in variants for h in kwic(text, v)]

Normalise spelling first, or search the variant set explicitly, before you say anything about how often a term appears.

What is the cost of building one, and is it worth it?

KWIC is cheap to compute and cheap to read for a few hundred lines — minutes of work in AntConc, Voyant or a five-line Python function. The real cost is interpretive time: someone has to read the lines. That cost scales linearly with hits, which is exactly why a 40,000-instance term defeats the method. Estimate hits first with a frequency count; if it is unreadable, aggregate or sample before building the concordance.

What does a concordance fundamentally miss?

It is term-driven, so it only ever shows you what you already thought to search for. A pattern carried by vocabulary you did not anticipate is invisible to KWIC. The standard remedy is to run a keyword or frequency pass first to discover the interesting terms, then concordance those — letting the data, not your prior assumptions, choose the search words.

Key Takeaways

  • Build a concordance for word-level questions on corpora too big to read whole.
  • Avoid it for document-level patterns and for terms with tens of thousands of hits.
  • A five-to-ten-word window suits most analyses; adjust to the question.
  • KWIC only finds the spellings you search, so normalise or search variants first.
  • Computation is cheap; the binding cost is human reading time, which scales with hits.
  • Pair KWIC with keyword and frequency methods so you study terms you would not have guessed.

Frequently Asked Questions

What is a KWIC concordance?

A keyword-in-context (KWIC) concordance lists every occurrence of a search term with a fixed span of words on each side, aligned on the term. It lets you read how a word is actually used across a corpus without reading every document in full.

When is a concordance the right tool?

Use it when your question is about how specific words or phrases are used in context — meaning, collocates, sense changes — and your corpus is large enough that reading every instance by hand is impractical but the term count is still human-readable.

When should I not build a concordance?

Skip it when your question is about whole-document patterns rather than specific words (use topic modelling or classification), or when a term appears tens of thousands of times so the concordance is too large to read. In those cases aggregate first.

How wide should the context window be?

Five to ten words each side is the usual range: enough to see grammatical context without overwhelming the eye. Widen it when you study argument structure, narrow it when you only need immediate collocates.

Do concordances work on dirty OCR text?

Partially. KWIC will only find the spellings you search for, so OCR errors and spelling variants hide instances and bias your view. Normalise spelling or search variant forms before drawing conclusions about frequency.

What is the main limitation of concordance analysis?

It is term-driven, so it only shows you what you already thought to search for and can miss patterns expressed in words you did not anticipate. Pair it with frequency and keyword methods that surface terms you would not have guessed.