No NULLs, yet invalid byte sequence for encoding "UTF8": 0x00

Question 1

I've spent the last 8 hours trying to import the output of 'mysqldump --compatible=postgresql' into PostgreSQL 8.4.9, and I've read at least 20 different threads here and elesewhere already about this specific problem, but found no real usable answer that works.

MySQL 5.1.52 data dumped:

mysqldump -u root -p --compatible=postgresql --no-create-info --no-create-db --default-character-set=utf8 --skip-lock-tables rt3 > foo

PostgreSQL 8.4.9 server as destination

Loading the data with 'psql -U rt_user -f foo' is reporting (many of these, here's one example):

psql:foo:29: ERROR: invalid byte sequence for encoding "UTF8": 0x00
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".

According the following, there are no NULL (0x00) characters in the input file.

database-dumps:rcf-temp1# sed 's/\x0/ /g' < foo > nonulls
database-dumps:rcf-temp1# sum foo nonulls
04730 2545610 foo
04730 2545610 nonulls
database-dumps:rcf-temp1# rm nonulls

Likewise, another check with Perl shows no NULLs:

database-dumps:rcf-temp1# perl -ne '/000円/ and print;' foo
database-dumps:rcf-temp1#

As the "HINT" in the error mentions, I have tried every possible way to set 'client_encoding' to 'UTF8', and I succeed but it has no effect toward solving my problem.

database-dumps:rcf-temp1# psql -U rt_user --variable=client_encoding=utf-8 -c "SHOW client_encoding;" rt3
 client_encoding
-----------------
 UTF8
(1 row)
database-dumps:rcf-temp1#

Perfect, yet:

database-dumps:rcf-temp1# psql -U rt_user -f foo --variable=client_encoding=utf-8 rt3
...
psql:foo:29: ERROR: invalid byte sequence for encoding "UTF8": 0x00
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
...

Barring the "According to Hoyle" correct answer, which would be fantastic to hear, and knowing that I really don't care about preserving any non-ASCII characters for this seldom-referenced data, what suggestions do you have?

Update: I get the same error with an ASCII-only version of the same dump file at import time. Truly mind-boggling:

database-dumps:rcf-temp1# # convert any non-ASCII character to a space
database-dumps:rcf-temp1# perl -i.bk -pe 's/[^[:ascii:]]/ /g;' mysql5-dump.sql
database-dumps:rcf-temp1# sum mysql5-dump.sql mysql5-dump.sql.bk
41053 2545611 mysql5-dump.sql
50145 2545611 mysql5-dump.sql.bk
database-dumps:rcf-temp1# cmp mysql5-dump.sql mysql5-dump.sql.bk
mysql5-dump.sql mysql5-dump.sql.bk differ: byte 1304850, line 30
database-dumps:rcf-temp1# # GOOD!
database-dumps:rcf-temp1# psql -U postgres -f mysql5-dump.sql --variable=client_encoding=utf-8 rt3
...
INSERT 0 416
psql:mysql5-dump.sql:30: ERROR: invalid byte sequence for encoding "UTF8": 0x00
HINT: This error can also happen if the byte sequence does not match the encod.
INSERT 0 455
INSERT 0 424
INSERT 0 483
INSERT 0 447
INSERT 0 503
psql:mysql5-dump.sql:36: ERROR: invalid byte sequence for encoding "UTF8": 0x00
HINT: This error can also happen if the byte sequence does not match the encod.
INSERT 0 502
INSERT 0 507
INSERT 0 318
INSERT 0 284
psql:mysql5-dump.sql:41: ERROR: invalid byte sequence for encoding "UTF8": 0x00
HINT: This error can also happen if the byte sequence does not match the encod.
INSERT 0 382
INSERT 0 419
INSERT 0 247
psql:mysql5-dump.sql:45: ERROR: invalid byte sequence for encoding "UTF8": 0x00
HINT: This error can also happen if the byte sequence does not match the encod.
INSERT 0 267
INSERT 0 348
^C

One of the tables in question is defined as:

 Table "public.attachments"
 Column | Type | Modifie
-----------------+-----------------------------+--------------------------------
 id | integer | not null default nextval('atta)
 transactionid | integer | not null
 parent | integer | not null default 0
 messageid | character varying(160) |
 subject | character varying(255) |
 filename | character varying(255) |
 contenttype | character varying(80) |
 contentencoding | character varying(80) |
 content | text |
 headers | text |
 creator | integer | not null default 0
 created | timestamp without time zone |
Indexes:
 "attachments_pkey" PRIMARY KEY, btree (id)
 "attachments1" btree (parent)
 "attachments2" btree (transactionid)
 "attachments3" btree (parent, transactionid)

I do not have the liberty to change the type for any part of the DB schema. Doing so would likely break future upgrades of the software, etc.

The likely problem column is 'content' of type 'text' (perhaps others in other tables as well). As I already know from previous research, PostgreSQL will not allow NULL in 'text' values. However, please see above where both sed and Perl show no NULL characters, and then further down where I strip all non-ASCII characters from the entire dump file but it still barfs.

Question 2

What does line 29 of your dump file look like? Something like head -29 foo | tail -1 | cat -v might be of use.

Question 3

What's the definition of the affected table and what does the offending line look like?

Question 4

It's ~1MB of company data. I understand where you're headed of course, though. Here's the end of that line of thought (please pardon my french at the end of the gist/paste): gist.github.com/1525788

Question 5

tscho: As indicated, that example error line is one of hundreds of these errors.

Question 6

One or more of those character/text fields MAY have 0x00 for its content.

Try the following:

SELECT * FROM rt3 where some_text_field = 0x00 LIMIT 1;

If this returns any single row then try updating those character/text fields with:

UPDATE rt3 SET some_text_field = '' WHERE some_text_field = 0x00;

Afterwards, try another MYSQLDUMP ... ( and PostgreSQL import method ).

Question 7

This helped me find my stray null characters, though I needed to use colname LIKE concat('%', 0x00, '%'). Found them in fields containing serialized PHP arrays.

Question 8

I had the same problem using MySQL version 5.0.51 and Postgres version 9.3.4.0. I solved the "invalid byte sequence for encoding "UTF8": 0x00" issue after seeing Daniel Vérité's comment that "mysqldump in postgresql mode will dump null bytes as 0円 in strings, so you probably want to search for that sequence of characters."

Sure enough a grep finally revealed the NULL chars.

grep \\\0円 dump.sql

I replaced the NULL chars using the following command

sed -i BAK 's/\0円//g' dump.sql

Postgres was then able to successfully load dump.sql

Question 9

Won't this insert empty strings instead of actual NULL values?

Question 10

You can get this error without any NULL byte or any non-ascii character in the file. Example in an utf8 database:

select E'ab0円cd';

will yield:

ERROR: invalid byte sequence for encoding "UTF8": 0x00 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".

mysqldump in postgresql mode will dump null bytes as 0円 in strings, so you probably want to search for that sequence of characters.

Question 11

I half remember a problem like this. I think I ended up migrating the schema then dumping the data as csv and loading the data from the csv file. I remember having to update the csv file (using unix tools like sed or unixtodos) or using open office calc (excell) to fix up some items that were errors on the import step - it could be as simple as opening and re-saving the file.

Farley Inglis Farley Inglis 561 bronze badge · Accepted Answer · 2013-10-10 21:31:47Z

One or more of those character/text fields MAY have 0x00 for its content.

Try the following:

SELECT * FROM rt3 where some_text_field = 0x00 LIMIT 1;

If this returns any single row then try updating those character/text fields with:

UPDATE rt3 SET some_text_field = '' WHERE some_text_field = 0x00;

Afterwards, try another MYSQLDUMP ... ( and PostgreSQL import method ).

This helped me find my stray null characters, though I needed to use colname LIKE concat('%', 0x00, '%'). Found them in fields containing serialized PHP arrays.

Stack Exchange Network

No NULLs, yet invalid byte sequence for encoding "UTF8": 0x00

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

No NULLs, yet invalid byte sequence for encoding "UTF8": 0x00

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions