Postgres database encoding problem

Question 1

I'm striving to convert badly encoded data from my table. For instance, I have a field with NadÃ ̈ge which should be Nadège.

I tried using Postgres's functions convert, convert_from, convert_to without much success.

db=# SHOW client_encoding;
 client_encoding 
-----------------
 UTF8
(1 row)
db=# SHOW server_encoding;
 server_encoding 
-----------------
 UTF8
(1 row)
db=# SELECT "firstName", encode("firstName"::bytea, 'hex') FROM contact; 
 firstName | encode 
-----------+--------------------
 Nadège | 4e6164c3a86765
 NadÃ ̈ge | 4e6164c383c2a86765
(2 rows)
db=# SELECT "firstName", convert_from("firstName"::bytea, 'latin1') FROM contact WHERE "lastName" ILIKE 'crochard';
 firstName | convert_from 
-----------+----------------
 Nadège | NadÃ ̈ge
 NadÃ ̈ge | NadÃ\u0083Â ̈ge
(2 rows)
db=# SELECT "firstName", convert("firstName"::bytea, 'utf8', 'latin1') FROM contact; 
 firstName | convert 
-----------+------------------
 Nadège | \x4e6164e86765
 NadÃ ̈ge | \x4e6164c3a86765
(2 rows)

Using python I'm able to get the correct encoding with:

data.encode('latin1').decode('utf8')

Any hint on how to convert these wrongly encoded data in postgres ?

Question 2

As you have correctly identified, NadÃ ̈ge is the UTF-8 representation of Nadège incorrectly decoded as ISO-8859-1 ("latin-1"). Then, in your case, re-encoded to UTF-8 for storage in the DB.

To fix it you need to:

Take the current representation and decode the UTF-8 to latin-1 as a byte string
re-interpret the byte string, decoding it as utf-8

So:

test=> SELECT convert_from(convert_to('NadÃ ̈ge', 'latin-1'), 'utf-8');
 convert_from 
--------------
 Nadège
(1 row)

The Python equivalent would be close to what you wrote, but starts with a unicode representation to illustrate that PostgreSQL stores everything in the database encoding. Something like:

>>> print u"NadÃ ̈ge".encode("latin-1").decode("utf-8")
Nadège

The problem with all your attempted solutions is that the cast from text to bytea uses the database encoding. So you're starting with the bytes for the utf-8 representation of utf-8 mis-decoded as latin-1. With the cast you'd have to write:

test=> SELECT convert_from(convert_to(convert_from((TEXT 'NadÃ ̈ge')::bytea, 'utf-8'), 'latin-1'), 'utf-8');
 convert_from 
--------------
 Nadège
(1 row)

because you have to explicitly decode the utf-8 representation produced by the cast before re-interpreting as latin-1 and decoding again.

You just needed to use convert_to(mycol, 'latin-1') instead of mycol::bytea

Question 3

We have solved our encoding issue with perl yesterday, so I cannot try out in a real environment. But your explanation about bytea casting and why we failed to convert seems to be the exact point. I accept your answer, thank you :)

Question 4

The second solution with the ::bytea cast, has one unexpected caveat: Backslashes in the text string need to be escaped (i.e. doubled), since bytea's input function interprets them as special!

Craig Ringer Craig Ringer 57.9k6 gold badges162 silver badges193 bronze badges · Accepted Answer · 2015-12-17 01:39:55Z

As you have correctly identified, NadÃ ̈ge is the UTF-8 representation of Nadège incorrectly decoded as ISO-8859-1 ("latin-1"). Then, in your case, re-encoded to UTF-8 for storage in the DB.

To fix it you need to:

Take the current representation and decode the UTF-8 to latin-1 as a byte string
re-interpret the byte string, decoding it as utf-8

So:

test=> SELECT convert_from(convert_to('NadÃ ̈ge', 'latin-1'), 'utf-8');
 convert_from 
--------------
 Nadège
(1 row)

The Python equivalent would be close to what you wrote, but starts with a unicode representation to illustrate that PostgreSQL stores everything in the database encoding. Something like:

>>> print u"NadÃ ̈ge".encode("latin-1").decode("utf-8")
Nadège

The problem with all your attempted solutions is that the cast from text to bytea uses the database encoding. So you're starting with the bytes for the utf-8 representation of utf-8 mis-decoded as latin-1. With the cast you'd have to write:

test=> SELECT convert_from(convert_to(convert_from((TEXT 'NadÃ ̈ge')::bytea, 'utf-8'), 'latin-1'), 'utf-8');
 convert_from 
--------------
 Nadège
(1 row)

because you have to explicitly decode the utf-8 representation produced by the cast before re-interpreting as latin-1 and decoding again.

You just needed to use convert_to(mycol, 'latin-1') instead of mycol::bytea

We have solved our encoding issue with perl yesterday, so I cannot try out in a real environment. But your explanation about bytea casting and why we failed to convert seems to be the exact point. I accept your answer, thank you :)
The second solution with the ::bytea cast, has one unexpected caveat: Backslashes in the text string need to be escaped (i.e. doubled), since bytea's input function interprets them as special!

Stack Exchange Network

Postgres database encoding problem

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Postgres database encoding problem

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions