How to solve UTF8 invalid byte sequence copy errors on a restore, when the source database is encoded in UTF8?

Question 1

I was given the task to migrate a PostgreSQL 8.2.x database to another server. To do this I'm using the pgAdmin 1.12.2 (on Ubuntu 11.04 by the way) and using the Backup and Restore using the custom/compress format (.backup) and UTF8 encoding.

The original database is in UTF8, like so:

-- Database: favela
-- DROP DATABASE favela;
CREATE DATABASE favela
 WITH OWNER = favela
 ENCODING = 'UTF8'
 TABLESPACE = favela
 CONNECTION LIMIT = -1;

I'm creating this database exactly like this on the destination server. But when I restore the database from the .backup file using the Restore option it gives me some of these errors:

pg_restore: restoring data for table "arena"
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 2173; 0 35500 TABLE DATA arena favela
pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence for encoding "UTF8": 0xe3a709
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
CONTEXT: COPY arena, line 62

When I check which record triggered this error in fact some vartext fields have diacritical characters like ç (used in Portuguese, for example, "caça"), and when I manually remove them from the text in the records the error passes to the next record that has them - since when copy has an error it stops inserting data on this table. And I don't want to replace them manually one by one to accomplish this.

But it's kinda of strange because with UTF8 there shouldn't be this kind of problems, right?

I don't know how they got there in the first place. I'm only migrating the database, and I supose that somehow the database was like in LATIN1 and then was improperly changed to UTF8.

Is there any way to check if a table/database has invalid UTF8 sequences? Or any way to enforce/reconvert these characters into UFT8 so I don't run into any problems when I execute the restore?

Thanks, in advance.

Question 2

Digging around the internet, I've seen that this is a pretty common problem. The common solution is to use the plain text format dump and feed it through iconv to correct the encoding.

Here is more information about that.

Question 3

use iconv to convert to UTF-32 discarding invalid symbols and then back to UTF-8, a UTF-8 to UTF-8 conversion won't catch all bad code points. (eg orphan surrogates)

Question 4

"I don't know how they got there in the first place"

It could have happened as described here - although this generates an error on 8.4:

If you create a table with any text type (i.e. text, varchar(10), etc.), then you can insert an invalid byte sequence into that field using octal escapes.

For instance, if you have a UTF8-encoded database, you can do:

=> CREATE TABLE foo(t TEXT);

=> INSERT INTO foo VALUES(E'377円');

Now, if you COPY the table out, you can't COPY the resulting file back in. That means your pg_dump backups won't be able to restore. The only way to get your data back in is to re-escape that value.

There is a good post on this excellent blog about the general issues and some ways to deal with them

Question 5

I don't recommend blindly running iconv on the plain text dump because it may convert valid characters(eg: Chinese characters ) to some other characters. It is better to find the invalid UTF8 character by running below command.

grep -naxv '.*' plain_text_dump.sql

and then run iconv on the particular data. Check this doc for detailed step by step explanation.

Question 6

Its likely with the default encoding used in your Unix/Linux environment. To check which encoding is currently the default one, execute the following:

$ echo $LANG
en_US

In this case, we can clearly see it is not an UTF-8 encoding, the one which the copy command relies on.

So to fix this, we just set the LANG variable in example to the following:

$ export LANG=en_US.UTF-8

Note: This will only be available for the current session. Add it to ~/.bashrc or similar to have it available on startup of any future shell session.

Reference

Question 7

I referenced the following link which gave me clues to determine the source encoding and then convert it into the desired UTF-8 encoding. Linux Check and Change Encoding

$ file -bi cabot.sql
text/plain; charset=utf-16le
$ iconv -f utf-16le -t utf-8 -o converted.sql cabot.sql
$ file -bi converted.sql
text/plain; charset=utf-8

Question 8

First, I exported a plain file in WIN1252 encoding:

$ sudo -u postgres pg_dump "OLD-DB" -F p -E WIN1252 -f /tmp/db.dumplain

I dropped wrong characters to UTF-8:

$ sudo -u postgres iconv -c -t utf-8 /tmp/db.dumplain > /tmp/db.utf8

Then imported with no problems:

$ sudo -u postgres psql "NEW-DB" < /tmp/db.utf8

Richard Richard 1 · Accepted Answer · 2011-08-18 19:09:00Z

9

Digging around the internet, I've seen that this is a pretty common problem. The common solution is to use the plain text format dump and feed it through iconv to correct the encoding.

Here is more information about that.

Share

Improve this answer

edited Jun 1, 2018 at 12:14

Colin 't Hart's user avatar

Colin 't Hart

9,48515 gold badges37 silver badges44 bronze badges

answered Aug 18, 2011 at 19:09

Richard's user avatar

Richard Richard

1

use iconv to convert to UTF-32 discarding invalid symbols and then back to UTF-8, a UTF-8 to UTF-8 conversion won't catch all bad code points. (eg orphan surrogates)

Jasen
– Jasen

2015年05月25日 23:04:55 +00:00
Commented May 25, 2015 at 23:04

Add a comment |

Stack Exchange Network

How to solve UTF8 invalid byte sequence copy errors on a restore, when the source database is encoded in UTF8?

6 Answers 6

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to solve UTF8 invalid byte sequence copy errors on a restore, when the source database is encoded in UTF8?

6 Answers 6

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions