How to Fix character encoding in OpenRefine

If OpenRefine shows Ã© where you expected é, you have a character-encoding mismatch — the file was decoded with the wrong codec on import. The reliable fix is to re-import and pick the correct encoding in the import preview before clicking Create Project; the live preview lets you confirm the accents render correctly. For projects already created, GREL's reinterpret() can repair the damage, but a clean re-import is always safer.

Why do accented characters turn into gibberish?

This corruption is called mojibake. It happens when bytes written in one encoding (commonly UTF-8) are read as another (commonly Windows-1252 or ISO-8859-1). The classic signatures:

text

é  -> Ã©
ñ  -> Ã±
"  -> â€œ
–  -> â€“

No characters are lost — they are merely mis-decoded. That is good news: the underlying bytes are usually recoverable if you identify the true encoding.

How do I find out which encoding my file really uses?

Before importing, ask the bytes. Two quick commands:

bash

# macOS / Linux
file -i records.csv
# -> records.csv: text/csv; charset=utf-8

# Cross-platform, via Python
python -m chardet records.csv
# -> records.csv: UTF-8-SIG with confidence 0.99

chardetect reports a best guess and a confidence score. If it says windows-1252 with high confidence but you assumed UTF-8, you have found your culprit. A UTF-8-SIG result means the file has a byte-order mark (BOM).

Step by step: setting encoding on import

Create Project ▸ This Computer and select your file.
On the preview screen, find the Character encoding box.
Click it and choose the encoding file/chardet reported (e.g. UTF-8, ISO-8859-1, windows-1252).
Watch the live preview — accented characters should render correctly.
Only when the preview is clean, click Create Project.

If the BOM leaves a stray character at the very start of the first header, pick UTF-8 (OpenRefine strips the BOM) rather than UTF-8-SIG-unaware options.

Can I fix encoding after the project already exists?

Yes, partially, with GREL. If you imported as Windows-1252 but the data was really UTF-8, add a column or transform with:

grel

reinterpret(value, "utf-8", "windows-1252")

The signature is reinterpret(value, outputEncoding, inputEncoding) — it re-decodes the string as though read with inputEncoding and re-encodes as outputEncoding. Test on one column first; if Ã© becomes é, apply across the affected columns. If it produces question marks or replacement characters (�), the original bytes were already destroyed and only a re-import will help.

Which encoding should I choose? A quick guide

Symptom	Likely true encoding	Action
`Ã©`, `â€"` everywhere	UTF-8 read as windows-1252	Re-import as UTF-8
`?` or boxes for accents	windows-1252 read as UTF-8	Re-import as windows-1252 / ISO-8859-1
Stray char before first header	BOM	Re-import as UTF-8
Cyrillic/Greek garbled	Legacy code page (e.g. windows-1251)	Identify with chardet, set explicitly

Pitfalls to avoid

Do not double-fix. Running reinterpret() twice re-corrupts clean text.
Do not trust the spreadsheet that created the file — Excel often writes windows-1252 even when you intended UTF-8.
Re-export deliberately. OpenRefine exports UTF-8 by default, so a correct import yields a clean export with no extra steps.
Keep the raw original. Never overwrite the source file until you have confirmed the cleaned export round-trips correctly.

Key Takeaways

Mojibake (Ã© for é) means the file was decoded with the wrong codec, not that data is lost.
The cleanest fix is re-importing with the correct encoding chosen in the live preview.
Use file -i or Python's chardet to detect the true encoding before importing.
reinterpret(value, "utf-8", "windows-1252") repairs many cases inside an existing project.
Replacement characters (�) mean the bytes are gone — re-import is the only cure.
OpenRefine exports UTF-8 by default, so a correct import gives a clean export automatically.

Frequently Asked Questions

That is mojibake: a UTF-8 file was read as Windows-1252 (or vice versa). The fix is to re-import the file and set the correct character encoding in the import preview before clicking Create Project.

Where do I set the character encoding in OpenRefine?

On the import preview screen there is a Character encoding box. Click it, choose the correct encoding such as UTF-8, ISO-8859-1, or windows-1252, and the preview updates live so you can confirm before creating the project.

Can I fix encoding after the project is already created?

Partially. You can repair some mojibake with GREL functions like reinterpret(value, 'UTF-8', 'windows-1252'), but it is cleaner and safer to delete the project and re-import with the correct encoding chosen up front.

What does the GREL reinterpret function do for encoding?

reinterpret(value, output, input) re-decodes a string as if it had been read with a different encoding. For example reinterpret(value, 'utf-8', 'windows-1252') often repairs accented characters that were imported with the wrong codec.

How do I find which encoding my source file actually uses?

Run a tool such as file -i on macOS or Linux, or chardetect from Python's chardet package, which inspect the bytes and report a best-guess encoding and confidence before you import.

Does OpenRefine export in UTF-8 by default?

Yes. OpenRefine exports CSV, TSV, and most text formats as UTF-8, so as long as your import was decoded correctly, the export will carry clean, well-formed characters.

Why do accented characters turn into gibberish? ​

How do I find out which encoding my file really uses? ​

Step by step: setting encoding on import ​

Can I fix encoding after the project already exists? ​

Which encoding should I choose? A quick guide ​

Pitfalls to avoid ​

Key Takeaways ​

Frequently Asked Questions ​

Why does my OpenRefine import show characters like Ã© instead of é? ​

Where do I set the character encoding in OpenRefine? ​

Can I fix encoding after the project is already created? ​

What does the GREL reinterpret function do for encoding? ​

How do I find which encoding my source file actually uses? ​

Does OpenRefine export in UTF-8 by default? ​

Related reading ​