Skip to content
Data Cleaning with OpenRefine

If OpenRefine shows é where you expected é, you have a character-encoding mismatch — the file was decoded with the wrong codec on import. The reliable fix is to re-import and pick the correct encoding in the import preview before clicking Create Project; the live preview lets you confirm the accents render correctly. For projects already created, GREL's reinterpret() can repair the damage, but a clean re-import is always safer.

Why do accented characters turn into gibberish?

This corruption is called mojibake. It happens when bytes written in one encoding (commonly UTF-8) are read as another (commonly Windows-1252 or ISO-8859-1). The classic signatures:

text
é  -> é
ñ  -> ñ
"  -> “
–  -> –

No characters are lost — they are merely mis-decoded. That is good news: the underlying bytes are usually recoverable if you identify the true encoding.

How do I find out which encoding my file really uses?

Before importing, ask the bytes. Two quick commands:

bash
# macOS / Linux
file -i records.csv
# -> records.csv: text/csv; charset=utf-8

# Cross-platform, via Python
python -m chardet records.csv
# -> records.csv: UTF-8-SIG with confidence 0.99

chardetect reports a best guess and a confidence score. If it says windows-1252 with high confidence but you assumed UTF-8, you have found your culprit. A UTF-8-SIG result means the file has a byte-order mark (BOM).

Step by step: setting encoding on import

  1. Create Project ▸ This Computer and select your file.
  2. On the preview screen, find the Character encoding box.
  3. Click it and choose the encoding file/chardet reported (e.g. UTF-8, ISO-8859-1, windows-1252).
  4. Watch the live preview — accented characters should render correctly.
  5. Only when the preview is clean, click Create Project.

If the BOM leaves a stray character at the very start of the first header, pick UTF-8 (OpenRefine strips the BOM) rather than UTF-8-SIG-unaware options.

Can I fix encoding after the project already exists?

Yes, partially, with GREL. If you imported as Windows-1252 but the data was really UTF-8, add a column or transform with:

grel
reinterpret(value, "utf-8", "windows-1252")

The signature is reinterpret(value, outputEncoding, inputEncoding) — it re-decodes the string as though read with inputEncoding and re-encodes as outputEncoding. Test on one column first; if é becomes é, apply across the affected columns. If it produces question marks or replacement characters (), the original bytes were already destroyed and only a re-import will help.

Which encoding should I choose? A quick guide

SymptomLikely true encodingAction
é, â€" everywhereUTF-8 read as windows-1252Re-import as UTF-8
? or boxes for accentswindows-1252 read as UTF-8Re-import as windows-1252 / ISO-8859-1
Stray char before first headerBOMRe-import as UTF-8
Cyrillic/Greek garbledLegacy code page (e.g. windows-1251)Identify with chardet, set explicitly

Pitfalls to avoid

  • Do not double-fix. Running reinterpret() twice re-corrupts clean text.
  • Do not trust the spreadsheet that created the file — Excel often writes windows-1252 even when you intended UTF-8.
  • Re-export deliberately. OpenRefine exports UTF-8 by default, so a correct import yields a clean export with no extra steps.
  • Keep the raw original. Never overwrite the source file until you have confirmed the cleaned export round-trips correctly.

Key Takeaways

  • Mojibake (é for é) means the file was decoded with the wrong codec, not that data is lost.
  • The cleanest fix is re-importing with the correct encoding chosen in the live preview.
  • Use file -i or Python's chardet to detect the true encoding before importing.
  • reinterpret(value, "utf-8", "windows-1252") repairs many cases inside an existing project.
  • Replacement characters () mean the bytes are gone — re-import is the only cure.
  • OpenRefine exports UTF-8 by default, so a correct import gives a clean export automatically.

Frequently Asked Questions

Why does my OpenRefine import show characters like é instead of é?

That is mojibake: a UTF-8 file was read as Windows-1252 (or vice versa). The fix is to re-import the file and set the correct character encoding in the import preview before clicking Create Project.

Where do I set the character encoding in OpenRefine?

On the import preview screen there is a Character encoding box. Click it, choose the correct encoding such as UTF-8, ISO-8859-1, or windows-1252, and the preview updates live so you can confirm before creating the project.

Can I fix encoding after the project is already created?

Partially. You can repair some mojibake with GREL functions like reinterpret(value, 'UTF-8', 'windows-1252'), but it is cleaner and safer to delete the project and re-import with the correct encoding chosen up front.

What does the GREL reinterpret function do for encoding?

reinterpret(value, output, input) re-decodes a string as if it had been read with a different encoding. For example reinterpret(value, 'utf-8', 'windows-1252') often repairs accented characters that were imported with the wrong codec.

How do I find which encoding my source file actually uses?

Run a tool such as file -i on macOS or Linux, or chardetect from Python's chardet package, which inspect the bytes and report a best-guess encoding and confidence before you import.

Does OpenRefine export in UTF-8 by default?

Yes. OpenRefine exports CSV, TSV, and most text formats as UTF-8, so as long as your import was decoded correctly, the export will carry clean, well-formed characters.