21

I was given the task to migrate a PostgreSQL 8.2.x database to another server. To do this I'm using the pgAdmin 1.12.2 (on Ubuntu 11.04 by the way) and using the Backup and Restore using the custom/compress format (.backup) and UTF8 encoding.

The original database is in UTF8, like so:

-- Database: favela
-- DROP DATABASE favela;
CREATE DATABASE favela
 WITH OWNER = favela
 ENCODING = 'UTF8'
 TABLESPACE = favela
 CONNECTION LIMIT = -1;

I'm creating this database exactly like this on the destination server. But when I restore the database from the .backup file using the Restore option it gives me some of these errors:

pg_restore: restoring data for table "arena"
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 2173; 0 35500 TABLE DATA arena favela
pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence for encoding "UTF8": 0xe3a709
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
CONTEXT: COPY arena, line 62

When I check which record triggered this error in fact some vartext fields have diacritical characters like ç (used in Portuguese, for example, "caça"), and when I manually remove them from the text in the records the error passes to the next record that has them - since when copy has an error it stops inserting data on this table. And I don't want to replace them manually one by one to accomplish this.

But it's kinda of strange because with UTF8 there shouldn't be this kind of problems, right?

I don't know how they got there in the first place. I'm only migrating the database, and I supose that somehow the database was like in LATIN1 and then was improperly changed to UTF8.

Is there any way to check if a table/database has invalid UTF8 sequences? Or any way to enforce/reconvert these characters into UFT8 so I don't run into any problems when I execute the restore?

Thanks, in advance.

asked Aug 18, 2011 at 17:38

6 Answers 6

9

Digging around the internet, I've seen that this is a pretty common problem. The common solution is to use the plain text format dump and feed it through iconv to correct the encoding.

Here is more information about that.

Colin 't Hart
9,48515 gold badges37 silver badges44 bronze badges
answered Aug 18, 2011 at 19:09
1
  • use iconv to convert to UTF-32 discarding invalid symbols and then back to UTF-8, a UTF-8 to UTF-8 conversion won't catch all bad code points. (eg orphan surrogates) Commented May 25, 2015 at 23:04
7

"I don't know how they got there in the first place"

It could have happened as described here - although this generates an error on 8.4:

If you create a table with any text type (i.e. text, varchar(10), etc.), then you can insert an invalid byte sequence into that field using octal escapes.

For instance, if you have a UTF8-encoded database, you can do:

=> CREATE TABLE foo(t TEXT);

=> INSERT INTO foo VALUES(E'377円');

Now, if you COPY the table out, you can't COPY the resulting file back in. That means your pg_dump backups won't be able to restore. The only way to get your data back in is to re-escape that value.

There is a good post on this excellent blog about the general issues and some ways to deal with them

answered Aug 18, 2011 at 21:59
0
3

I don't recommend blindly running iconv on the plain text dump because it may convert valid characters(eg: Chinese characters ) to some other characters. It is better to find the invalid UTF8 character by running below command.

grep -naxv '.*' plain_text_dump.sql

and then run iconv on the particular data. Check this doc for detailed step by step explanation.

answered Jun 5, 2019 at 6:07
1

Its likely with the default encoding used in your Unix/Linux environment. To check which encoding is currently the default one, execute the following:

$ echo $LANG
en_US

In this case, we can clearly see it is not an UTF-8 encoding, the one which the copy command relies on.

So to fix this, we just set the LANG variable in example to the following:

$ export LANG=en_US.UTF-8

Note: This will only be available for the current session. Add it to ~/.bashrc or similar to have it available on startup of any future shell session.

Reference

answered Aug 14, 2014 at 0:59
0

I referenced the following link which gave me clues to determine the source encoding and then convert it into the desired UTF-8 encoding. Linux Check and Change Encoding

$ file -bi cabot.sql
text/plain; charset=utf-16le
$ iconv -f utf-16le -t utf-8 -o converted.sql cabot.sql
$ file -bi converted.sql
text/plain; charset=utf-8
answered Feb 6, 2020 at 13:58
0

First, I exported a plain file in WIN1252 encoding:

$ sudo -u postgres pg_dump "OLD-DB" -F p -E WIN1252 -f /tmp/db.dumplain

I dropped wrong characters to UTF-8:

$ sudo -u postgres iconv -c -t utf-8 /tmp/db.dumplain > /tmp/db.utf8

Then imported with no problems:

$ sudo -u postgres psql "NEW-DB" < /tmp/db.utf8
answered Feb 29, 2024 at 18:39

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.