When to Curate messy research spreadsheets

Curate a messy research spreadsheet when the underlying values are trustworthy but the layout is untidy; do not curate it when provenance is lost or incompatible record types are jammed into one sheet — in that case cleaning just makes bad data look authoritative. The decision hinges on a single question: after cleaning, will you be able to explain where every value came from? If not, the effort is misdirected, and you should rebuild or recollect instead.

What makes a spreadsheet "worth" curating?

A good candidate has sound data trapped in bad structure: consistent meaning per column, real values, recoverable provenance. A poor candidate has structural rot — colour as data, merged cells, several tables stacked vertically, free text where categories belong. Curating the second kind without first repairing structure produces a clean-looking file that silently encodes errors. Diagnose before deciding:

text

GOOD candidate                      POOR candidate (fix structure first)
- one variable per column           - meaning stored in cell colour
- one observation per row           - merged cells spanning data
- consistent units in a column      - units mixed into values ("3 km")
- a usable source/provenance note   - unknown or lost origin

When NOT to curate (yet)

Walk away from value-level cleaning when:

Provenance is gone. You cannot say which archive, page or run a value came from. Cleaning here launders uncertainty into false precision.
The sheet is a live collaborative document. Curating a moving target is irreproducible; freeze a snapshot first.
Recollection is cheaper. If untangling will take longer than re-extracting from the source, re-extract.
Multiple record types coexist. People, places and events on one sheet should become separate tidy tables, not one cleaned blob.

Cost and trade-off: a realistic budget

Mess level	Symptoms	Rough effort / 1k rows	Recommended tool
Cosmetic	Stray whitespace, case	15–30 min	Spreadsheet find/replace
Moderate	Inconsistent dates, categories	1–3 hrs	OpenRefine
Structural	Colour-as-data, merged cells	3–8 hrs	Manual restructure + script
Provenance-broken	Unknown origins	Often not worth it	Recollect / rebuild

If your estimate climbs toward "recollect" territory, that is itself the answer.

Should I clean in the spreadsheet or export it?

Export anything past cosmetic fixes. In-cell editing leaves no trail — you cannot answer "what did this value used to be?" OpenRefine, by contrast, records every operation as a reproducible recipe you can export as JSON and replay on the next batch:

json

[
  { "op": "core/text-transform", "columnName": "date", "expression": "value.toDate()" },
  { "op": "core/mass-edit", "columnName": "county", "edits": [
      { "from": ["Yorks", "Yorks."], "to": "Yorkshire" } ] }
]

That recipe is your documentation. A scripted pandas pipeline gives the same auditability if you prefer code.

How do I preserve meaning encoded in formatting?

Spreadsheets often smuggle data into appearance: red cells for "uncertain", bold for "checked", position implying a category. Before reshaping, lift that signal into explicit columns, or it vanishes:

python

import openpyxl
wb = openpyxl.load_workbook("survey.xlsx")
ws = wb.active
for row in ws.iter_rows(min_row=2):
    cell = row[3]
    if cell.fill.fgColor.rgb == "FFFF0000":   # red = uncertain
        row[8].value = "uncertain"            # new explicit flag column
wb.save("survey_flagged.xlsx")

Only after the implicit conventions are explicit is it safe to tidy and reshape.

A pragmatic decision rule

Use this order of operations: (1) snapshot and version the file; (2) check provenance — abort if absent; (3) extract formatting-as-data into columns; (4) split mixed record types; (5) only then normalise values in OpenRefine or a script. The first three steps are the actual "when to curate" gate. If a sheet fails them, more cleaning is wasted effort, not progress.

Key Takeaways

Curate when structure is bad but data is trustworthy; rebuild when provenance is lost.
The gating question is whether you can still explain every value's origin.
Freeze a versioned snapshot before curating — never clean a live collaborative sheet.
Move past cosmetic fixes into OpenRefine or a script so the work is auditable.
Extract meaning hidden in colour, position and formatting before reshaping.
Split mixed record types into separate tidy tables rather than one cleaned blob.
If cleaning costs more than recollecting, recollect.

Frequently Asked Questions

When is it worth curating a messy spreadsheet versus rebuilding it?

Curate when the data is sound but the structure is untidy, and rebuild when the spreadsheet mixes incompatible record types or has lost provenance. If you cannot trust where values came from, cleaning only launders bad data.

Should I clean in the spreadsheet or export to a tool like OpenRefine?

For anything beyond cosmetic fixes, export to OpenRefine or a scripted pipeline. In-cell editing is unauditable, whereas OpenRefine records every operation and lets you replay or undo the whole history.

What signals tell me a spreadsheet is not worth curating yet?

Colour-coded meaning, merged cells carrying data, multiple tables on one sheet, and values mixed with units are signs the structure must be fixed before curation pays off. Curate the layout first, then the values.

How long should curating a research spreadsheet take?

Budget roughly one to three hours per thousand rows for a moderately messy sheet, more if free text or dates need normalising. If your estimate exceeds the cost of recollecting the data, reconsider the approach.

Can I curate a live spreadsheet that collaborators keep editing?

Not reliably. Freeze a versioned snapshot, curate that, and feed corrections back through a documented process. Curating a moving target produces irreproducible results.

Does tidying a spreadsheet lose information?

It can if you discard formatting that encoded meaning, such as cell colour or position. Extract that signal into explicit columns before reshaping, so no human-readable convention is silently dropped.

What makes a spreadsheet "worth" curating? ​

When NOT to curate (yet) ​

Cost and trade-off: a realistic budget ​

Should I clean in the spreadsheet or export it? ​

How do I preserve meaning encoded in formatting? ​

A pragmatic decision rule ​

Key Takeaways ​

Frequently Asked Questions ​

When is it worth curating a messy spreadsheet versus rebuilding it? ​

Should I clean in the spreadsheet or export to a tool like OpenRefine? ​

What signals tell me a spreadsheet is not worth curating yet? ​

How long should curating a research spreadsheet take? ​

Can I curate a live spreadsheet that collaborators keep editing? ​

Does tidying a spreadsheet lose information? ​

Related reading ​