Appearance
Export crowdsourced data when a logical unit — a collection, a volume, a defined batch — is complete and reviewed, and clean it as a separate, reproducible step that never touches the raw export. The cost of exporting too early is wasted effort cleaning data that will change; the cost of exporting too late is no backups and no ability to spot quality problems while you can still fix them. The right rhythm is small validation exports early, scheduled snapshots throughout, and a definitive bulk export at completion.
When is the right moment to export?
Think in three distinct moments, not one. The validation export happens in week one: pull 20–50 records to confirm your schema, encoding, and cleaning scripts work before thousands of volunteers contribute against a broken assumption. The snapshot export runs on a schedule for backup and trend monitoring. The definitive export comes when a unit is finished and reviewed — this is the one you clean fully and cite. Conflating these leads to either premature cleaning or dangerous data loss.
When is exporting not worth it yet?
There are clear "not yet" signals. If fewer than a few percent of items are complete, a full export mostly captures placeholders. If your guidelines are still changing weekly, anything you clean now will need redoing. If consensus rules are unsettled, the aggregated field is meaningless. In all three cases the better move is to stabilise the task, export a tiny diagnostic sample, and defer the full pull. Exporting is cheap; cleaning the wrong export is not.
Should I clean inside the platform or after export?
Split the work by where the problem lives:
| Problem | Fix where | Why |
|---|---|---|
| Systemic guideline error | In the platform | Stops it recurring at source |
| Inconsistent date formats | After export | Reconciliation tooling is better outside |
| Duplicate / variant tags | After export | OpenRefine clustering is purpose-built |
| A confusing field volunteers misread | In the platform | UI fix prevents future bad data |
| One-off typos | After export | Cheaper than re-queuing items |
The principle: repair causes in the platform, repair symptoms downstream.
How do I keep cleaning reproducible?
Treat the raw export as immutable and write cleaning as a script that emits a separate file. A minimal pattern:
python
import pandas as pd
raw = pd.read_csv("exports/2025-03-08_raw.csv") # never overwrite this
clean = (
raw
.assign(date=lambda d: pd.to_datetime(d["date"], errors="coerce"))
.assign(name=lambda d: d["name"].str.strip().str.replace(r"\s+", " ", regex=True))
.drop_duplicates(subset=["page_id", "field"])
)
clean.to_csv("derived/2025-03-08_clean.csv", index=False)Because the transformation is code, you can re-run it against next month's export, diff the results, and prove exactly what changed — essential for a citable, defensible dataset.
What should a scheduled export rhythm look like?
For an active project, automate snapshots and keep them dated and versioned:
- Weekly read-only snapshot for backup and quick quality checks.
- Monthly snapshot retained long-term for trend analysis.
- On milestone completion, a definitive export that you clean and publish.
- Every export filename carries an ISO date so chronology is unambiguous.
- Snapshots live in versioned storage, not a single overwritten file.
This gives you safety and history without nagging volunteers or destabilising the live task.
What are the trade-offs of cleaning crowdsourced data at all?
Cleaning is not free of risk. Aggressive normalisation can erase meaningful variation — original spellings in a transcription are often the point, not noise. Over-eager deduplication can collapse genuinely distinct records. And every cleaning rule is an interpretive decision that should be documented. The trade-off is legibility versus fidelity: clean enough to be usable, but preserve the raw layer so a future researcher can reconstruct what volunteers actually wrote.
Key Takeaways
- Export small validation samples early; reserve full exports for completed units.
- Don't export when completion is tiny or guidelines are still shifting.
- Fix systemic causes in the platform and symptoms downstream.
- Keep the raw export immutable and clean via a reproducible script.
- Run scheduled, dated snapshots for backup and trend data.
- Preserve original spellings and variation; clean for legibility, not erasure.
- Document every cleaning rule as an interpretive decision.
Frequently Asked Questions
When should I first export crowdsourced data?
Export a small sample early, within the first week, to validate the schema and your cleaning scripts. Hold the full bulk export until a collection or volume is genuinely complete and reviewed.
Should I clean inside the platform or after export?
Do structural reconciliation after export in tools like OpenRefine or pandas, but fix systemic guideline problems inside the platform so they stop recurring at the source.
How often should I run bulk exports?
For active projects, a scheduled weekly or monthly export gives you backups and trend data without disrupting volunteers. Avoid daily full exports unless you have an automated pipeline.
Is it safe to clean crowdsourced data destructively?
No. Always keep the raw export immutable and write cleaning as a reproducible script that produces a separate derived file, so you can re-run it when rules change.
What if a project is still ongoing?
Export read-only snapshots for analysis and backup, but treat them as provisional. Wait for completion before publishing a citable dataset, since late edits can shift results.
When is exporting not worth it yet?
If under a few percent of items are complete, or guidelines are still changing weekly, a full export mostly captures noise. Stabilise the task first, then export.