Skip to content
Cultural Analytics

To measure novelty and influence in a corpus, represent each dated document as a feature distribution (usually a topic mixture), then compute how far it diverges from a backward window of earlier documents (novelty) and how far later documents move toward it (influence). Use Jensen-Shannon divergence, test several window sizes, and treat the scores as leads to investigate rather than proof of causation. Precise dating is the non-negotiable prerequisite — everything else is tuning.

Measuring novelty and influence turns a vague intuition ("this work was ahead of its time") into a reproducible number. The framework, developed by Barron, Klingenstein and others, defines novelty as a document's divergence from its past and resonance as whether its future converges toward it. It works on any dated, ordered corpus.

Step 1: How do I represent each document?

Reduce every document to a comparable probability distribution. A topic model is the usual choice: each document becomes a vector of topic proportions summing to one.

python
# Each row: document topic distribution from a fitted LDA model
import numpy as np
theta = np.load("doc_topic_matrix.npy")   # shape: (n_docs, n_topics)
dates = np.load("doc_dates.npy")           # ordered timestamps
order = np.argsort(dates)
theta, dates = theta[order], dates[order]

Whatever representation you pick — topics, word vectors, character n-grams — it must be consistent across the whole corpus and capture what you mean by "content".

Step 2: How do I compute novelty?

Novelty is the average divergence between a document and the window of documents that precede it. Use Jensen-Shannon divergence, which is symmetric and bounded:

python
from scipy.spatial.distance import jensenshannon

def novelty(i, theta, w):
    past = theta[max(0, i-w):i]
    if len(past) == 0:
        return np.nan
    return np.mean([jensenshannon(theta[i], p)**2 for p in past])

A high novelty score means the document looks unlike everything in its recent past. KL divergence is the classic alternative but is asymmetric and breaks on zero probabilities, so JSD is the safer default.

Step 3: How do I compute influence?

Influence (resonance) compares novelty against transience — the divergence between the document and its future window. A work resonates if the future stays close to it after it appeared, rather than reverting.

QuantityWindowHigh value means
NoveltyPastUnlike what came before
TransienceFutureThe future diverges away again
ResonanceNovelty − TransienceNovelty that sticks

A novel-but-transient work was a flash; a novel-and-resonant work shifted the corpus. Plot resonance against novelty: the regression slope summarises how strongly novelty translated into lasting influence in your collection.

Step 4: How do I choose and test the window?

The window size silently determines the result, so never use a single arbitrary value. Sweep it and report sensitivity:

python
for w in [50, 100, 200, 400]:
    scores = [novelty(i, theta, w) for i in range(w, len(theta)-w)]
    print(w, round(np.nanmean(scores), 4))

If your conclusion flips between windows, you do not have a finding — you have an artefact of a tuning choice. Robustness across windows is what makes the result defensible.

What pitfalls undermine novelty and influence work?

  • Coarse or wrong dates mix the past and future windows and destroy the measure.
  • Edge effects: the first and last window-widths of documents have no full comparison set — exclude them.
  • Confounding context: convergence may reflect a shared external event, not influence.
  • Over-reading a single representation — re-run with different features before claiming a result.
  • Causal language the design cannot support; these are correlational signals.

How do I interpret the numbers responsibly?

Treat high-novelty, high-resonance documents as candidates for influential works and go read them in context. The score tells you where to look, not what happened. Pair every quantitative claim with archival evidence of actual transmission — citations, reprints, correspondence — before asserting that one work influenced another.

Key Takeaways

  • Novelty is divergence from the past; resonance is whether the future stays close.
  • Represent each dated document as a consistent feature distribution, usually topic mixtures.
  • Use Jensen-Shannon divergence; it is symmetric and stable where KL is not.
  • Resonance equals novelty minus transience — novelty that endures.
  • Sweep the window size and report sensitivity; a single window can manufacture a result.
  • Scores are leads to investigate, never proof of causal influence.

Frequently Asked Questions

How is novelty measured computationally?

Novelty is typically the distance between a document and the documents that preceded it: how unlike the past is this text? Using topic distributions, novelty is the divergence from a backward-looking window, and influence is the convergence of the future window toward it.

What is the difference between novelty and influence?

Novelty looks backward — how different a work is from what came before. Influence (or transience versus resonance) looks forward — whether later works move toward the novel work or away from it. A work can be novel without being influential.

Which divergence measure should I use?

Kullback-Leibler divergence is the classic choice for comparing probability distributions like topic mixtures, but it is asymmetric and unstable on zeros. Jensen-Shannon divergence is symmetric, bounded and a safer default for most corpora.

Why must my corpus be precisely dated and ordered?

Because novelty and influence are defined relative to temporal windows. If dates are wrong or coarse, the backward and forward windows mix, and you measure noise. Reliable, fine-grained dating is a hard prerequisite.

How large should the comparison window be?

It depends on your corpus density; common choices span one to several years on each side. Test several window sizes and report how sensitive your results are to the choice, since a single arbitrary window can manufacture a result.

Can this distinguish genuine influence from shared context?

Not on its own. Two works can converge because both respond to the same external event, not because one influenced the other. Computational scores are evidence to investigate, not proof of a causal chain.