I'm striving to convert badly encoded data from my table. For instance, I have a field with Nadà ̈ge
which should be Nadège
.
I tried using Postgres's functions convert
, convert_from
, convert_to
without much success.
db=# SHOW client_encoding;
client_encoding
-----------------
UTF8
(1 row)
db=# SHOW server_encoding;
server_encoding
-----------------
UTF8
(1 row)
db=# SELECT "firstName", encode("firstName"::bytea, 'hex') FROM contact;
firstName | encode
-----------+--------------------
Nadège | 4e6164c3a86765
Nadà ̈ge | 4e6164c383c2a86765
(2 rows)
db=# SELECT "firstName", convert_from("firstName"::bytea, 'latin1') FROM contact WHERE "lastName" ILIKE 'crochard';
firstName | convert_from
-----------+----------------
Nadège | Nadà ̈ge
Nadà ̈ge | NadÃ\u0083 ̈ge
(2 rows)
db=# SELECT "firstName", convert("firstName"::bytea, 'utf8', 'latin1') FROM contact;
firstName | convert
-----------+------------------
Nadège | \x4e6164e86765
Nadà ̈ge | \x4e6164c3a86765
(2 rows)
Using python I'm able to get the correct encoding with:
data.encode('latin1').decode('utf8')
Any hint on how to convert these wrongly encoded data in postgres ?
1 Answer 1
As you have correctly identified, Nadà ̈ge
is the UTF-8 representation of Nadège
incorrectly decoded as ISO-8859-1 ("latin-1"). Then, in your case, re-encoded to UTF-8 for storage in the DB.
To fix it you need to:
- Take the current representation and decode the UTF-8 to latin-1 as a byte string
- re-interpret the byte string, decoding it as utf-8
So:
test=> SELECT convert_from(convert_to('Nadà ̈ge', 'latin-1'), 'utf-8');
convert_from
--------------
Nadège
(1 row)
The Python equivalent would be close to what you wrote, but starts with a unicode representation to illustrate that PostgreSQL stores everything in the database encoding. Something like:
>>> print u"Nadà ̈ge".encode("latin-1").decode("utf-8")
Nadège
The problem with all your attempted solutions is that the cast from text
to bytea
uses the database encoding. So you're starting with the bytes for the utf-8 representation of utf-8 mis-decoded as latin-1. With the cast you'd have to write:
test=> SELECT convert_from(convert_to(convert_from((TEXT 'Nadà ̈ge')::bytea, 'utf-8'), 'latin-1'), 'utf-8');
convert_from
--------------
Nadège
(1 row)
because you have to explicitly decode the utf-8 representation produced by the cast before re-interpreting as latin-1 and decoding again.
You just needed to use convert_to(mycol, 'latin-1')
instead of mycol::bytea
-
We have solved our encoding issue with perl yesterday, so I cannot try out in a real environment. But your explanation about
bytea
casting and why we failed to convert seems to be the exact point. I accept your answer, thank you :)Cyrbil– Cyrbil2015年12月17日 10:15:01 +00:00Commented Dec 17, 2015 at 10:15 -
1The second solution with the
::bytea
cast, has one unexpected caveat: Backslashes in the text string need to be escaped (i.e. doubled), sincebytea
's input function interprets them as special!cstork– cstork2020年11月28日 17:23:42 +00:00Commented Nov 28, 2020 at 17:23