I was given the task to migrate a PostgreSQL 8.2.x database to another server. To do this I'm using the pgAdmin 1.12.2 (on Ubuntu 11.04 by the way) and using the Backup and Restore using the custom/compress format (.backup) and UTF8 encoding.
The original database is in UTF8, like so:
-- Database: favela
-- DROP DATABASE favela;
CREATE DATABASE favela
WITH OWNER = favela
ENCODING = 'UTF8'
TABLESPACE = favela
CONNECTION LIMIT = -1;
I'm creating this database exactly like this on the destination server. But when I restore the database from the .backup file using the Restore option it gives me some of these errors:
pg_restore: restoring data for table "arena"
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 2173; 0 35500 TABLE DATA arena favela
pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence for encoding "UTF8": 0xe3a709
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
CONTEXT: COPY arena, line 62
When I check which record triggered this error in fact some vartext fields have diacritical characters like ç (used in Portuguese, for example, "caça"), and when I manually remove them from the text in the records the error passes to the next record that has them - since when copy has an error it stops inserting data on this table. And I don't want to replace them manually one by one to accomplish this.
But it's kinda of strange because with UTF8 there shouldn't be this kind of problems, right?
I don't know how they got there in the first place. I'm only migrating the database, and I supose that somehow the database was like in LATIN1 and then was improperly changed to UTF8.
Is there any way to check if a table/database has invalid UTF8 sequences? Or any way to enforce/reconvert these characters into UFT8 so I don't run into any problems when I execute the restore?
Thanks, in advance.
6 Answers 6
Digging around the internet, I've seen that this is a pretty common problem. The common solution is to use the plain text format dump and feed it through iconv to correct the encoding.
Here is more information about that.
-
use iconv to convert to UTF-32 discarding invalid symbols and then back to UTF-8, a UTF-8 to UTF-8 conversion won't catch all bad code points. (eg orphan surrogates)Jasen– Jasen2015年05月25日 23:04:55 +00:00Commented May 25, 2015 at 23:04
"I don't know how they got there in the first place"
It could have happened as described here - although this generates an error on 8.4:
If you create a table with any text type (i.e. text, varchar(10), etc.), then you can insert an invalid byte sequence into that field using octal escapes.
For instance, if you have a UTF8-encoded database, you can do:
=> CREATE TABLE foo(t TEXT);
=> INSERT INTO foo VALUES(E'377円');
Now, if you COPY the table out, you can't COPY the resulting file back in. That means your pg_dump backups won't be able to restore. The only way to get your data back in is to re-escape that value.
There is a good post on this excellent blog about the general issues and some ways to deal with them
I don't recommend blindly running iconv on the plain text dump because it may convert valid characters(eg: Chinese characters ) to some other characters. It is better to find the invalid UTF8 character by running below command.
grep -naxv '.*' plain_text_dump.sql
and then run iconv on the particular data. Check this doc for detailed step by step explanation.
Its likely with the default encoding used in your Unix/Linux environment. To check which encoding is currently the default one, execute the following:
$ echo $LANG
en_US
In this case, we can clearly see it is not an UTF-8 encoding, the one which the copy command relies on.
So to fix this, we just set the LANG variable in example to the following:
$ export LANG=en_US.UTF-8
Note: This will only be available for the current session. Add it to ~/.bashrc or similar to have it available on startup of any future shell session.
I referenced the following link which gave me clues to determine the source encoding and then convert it into the desired UTF-8 encoding. Linux Check and Change Encoding
$ file -bi cabot.sql
text/plain; charset=utf-16le
$ iconv -f utf-16le -t utf-8 -o converted.sql cabot.sql
$ file -bi converted.sql
text/plain; charset=utf-8
First, I exported a plain file in WIN1252 encoding:
$ sudo -u postgres pg_dump "OLD-DB" -F p -E WIN1252 -f /tmp/db.dumplain
I dropped wrong characters to UTF-8:
$ sudo -u postgres iconv -c -t utf-8 /tmp/db.dumplain > /tmp/db.utf8
Then imported with no problems:
$ sudo -u postgres psql "NEW-DB" < /tmp/db.utf8