Appearance
If OpenRefine shows é where you expected é, you have a character-encoding mismatch — the file was decoded with the wrong codec on import. The reliable fix is to re-import and pick the correct encoding in the import preview before clicking Create Project; the live preview lets you confirm the accents render correctly. For projects already created, GREL's reinterpret() can repair the damage, but a clean re-import is always safer.
Why do accented characters turn into gibberish?
This corruption is called mojibake. It happens when bytes written in one encoding (commonly UTF-8) are read as another (commonly Windows-1252 or ISO-8859-1). The classic signatures:
text
é -> é
ñ -> ñ
" -> “
– -> –No characters are lost — they are merely mis-decoded. That is good news: the underlying bytes are usually recoverable if you identify the true encoding.
How do I find out which encoding my file really uses?
Before importing, ask the bytes. Two quick commands:
bash
# macOS / Linux
file -i records.csv
# -> records.csv: text/csv; charset=utf-8
# Cross-platform, via Python
python -m chardet records.csv
# -> records.csv: UTF-8-SIG with confidence 0.99chardetect reports a best guess and a confidence score. If it says windows-1252 with high confidence but you assumed UTF-8, you have found your culprit. A UTF-8-SIG result means the file has a byte-order mark (BOM).
Step by step: setting encoding on import
- Create Project ▸ This Computer and select your file.
- On the preview screen, find the Character encoding box.
- Click it and choose the encoding
file/chardetreported (e.g.UTF-8,ISO-8859-1,windows-1252). - Watch the live preview — accented characters should render correctly.
- Only when the preview is clean, click Create Project.
If the BOM leaves a stray character at the very start of the first header, pick UTF-8 (OpenRefine strips the BOM) rather than UTF-8-SIG-unaware options.
Can I fix encoding after the project already exists?
Yes, partially, with GREL. If you imported as Windows-1252 but the data was really UTF-8, add a column or transform with:
grel
reinterpret(value, "utf-8", "windows-1252")The signature is reinterpret(value, outputEncoding, inputEncoding) — it re-decodes the string as though read with inputEncoding and re-encodes as outputEncoding. Test on one column first; if é becomes é, apply across the affected columns. If it produces question marks or replacement characters (�), the original bytes were already destroyed and only a re-import will help.
Which encoding should I choose? A quick guide
| Symptom | Likely true encoding | Action |
|---|---|---|
é, â€" everywhere | UTF-8 read as windows-1252 | Re-import as UTF-8 |
? or boxes for accents | windows-1252 read as UTF-8 | Re-import as windows-1252 / ISO-8859-1 |
| Stray char before first header | BOM | Re-import as UTF-8 |
| Cyrillic/Greek garbled | Legacy code page (e.g. windows-1251) | Identify with chardet, set explicitly |
Pitfalls to avoid
- Do not double-fix. Running
reinterpret()twice re-corrupts clean text. - Do not trust the spreadsheet that created the file — Excel often writes
windows-1252even when you intended UTF-8. - Re-export deliberately. OpenRefine exports UTF-8 by default, so a correct import yields a clean export with no extra steps.
- Keep the raw original. Never overwrite the source file until you have confirmed the cleaned export round-trips correctly.
Key Takeaways
- Mojibake (
éforé) means the file was decoded with the wrong codec, not that data is lost. - The cleanest fix is re-importing with the correct encoding chosen in the live preview.
- Use
file -ior Python'schardetto detect the true encoding before importing. reinterpret(value, "utf-8", "windows-1252")repairs many cases inside an existing project.- Replacement characters (
�) mean the bytes are gone — re-import is the only cure. - OpenRefine exports UTF-8 by default, so a correct import gives a clean export automatically.
Frequently Asked Questions
Why does my OpenRefine import show characters like é instead of é?
That is mojibake: a UTF-8 file was read as Windows-1252 (or vice versa). The fix is to re-import the file and set the correct character encoding in the import preview before clicking Create Project.
Where do I set the character encoding in OpenRefine?
On the import preview screen there is a Character encoding box. Click it, choose the correct encoding such as UTF-8, ISO-8859-1, or windows-1252, and the preview updates live so you can confirm before creating the project.
Can I fix encoding after the project is already created?
Partially. You can repair some mojibake with GREL functions like reinterpret(value, 'UTF-8', 'windows-1252'), but it is cleaner and safer to delete the project and re-import with the correct encoding chosen up front.
What does the GREL reinterpret function do for encoding?
reinterpret(value, output, input) re-decodes a string as if it had been read with a different encoding. For example reinterpret(value, 'utf-8', 'windows-1252') often repairs accented characters that were imported with the wrong codec.
How do I find which encoding my source file actually uses?
Run a tool such as file -i on macOS or Linux, or chardetect from Python's chardet package, which inspect the bytes and report a best-guess encoding and confidence before you import.
Does OpenRefine export in UTF-8 by default?
Yes. OpenRefine exports CSV, TSV, and most text formats as UTF-8, so as long as your import was decoded correctly, the export will carry clean, well-formed characters.