ERROR: invalid byte sequence for encoding "UTF8": 0xdc 0x36

Question 1

When running a \copy (either pgadmin or aws_s3.table_import_from_s3) of a 1.6GB file into an AWS Aurora Postgres-compatible database, I'm getting the following error:

ERROR: invalid byte sequence for encoding "UTF8": 0xdc 0x36
CONTEXT: COPY staging, line 99779: "L24000403170365 ACTIVEZONE LLC ..."

EDIT: Here's what I could pull for table definition (but let me know if you want more):

column_name	data_type	character_maximum_length	is_nullable	column_default
raw	text	[null]	YES	[null]

EDIT: I also tried to change the column to bytea with no effect.

The source is supposed to be ASCII, but I get the same error with explicit encodings like utf8, latin1, win1251, and win1252. EDIT: As requested in a reply, here's more information about the import commands. In pgadmin4, I'm right-click importing into the table which shows the following under the covers:

--command " "\\copy public.staging (\"raw\") FROM 'C:/data.txt' DELIMITER '|' ENCODING 'UTF8';""

I also use pgadmin4 to trigger the s3 table import by calling the query:

SELECT aws_s3.table_import_from_s3(
 'staging',
 '', 
 '(DELIMITER ''|'')',
 aws_commons.create_s3_uri('data', 'data.txt', 'us-east-1')
);

Under the covers, table_import_from_s3 calls the command:

copy staging from '/rdsdbdata/extensions/aws_s3/{{internal filename}}' with (DELIMITER '|')

The answer to similar questions is to clean up the source data so I pulled up python and tried to find the offending character. I couldn't find any evidence of an unusual character at or around the referenced line. For the sake of argument, I believe the following will scan the entire file (and you can see the results inline):

>>> def charinfile(filename, bytechar):
... with open(filename, 'rb') as file:
... byteline = file.readline()
... while byteline: # readline returns empty string at EOF
... if byteline.find(bytechar) != -1:
... print("found!")
... return byteline
... byteline = file.readline()
... else:
... print("not found")
...
>>> charinfile(filename, b'\xdc')
not found
>>> charinfile(filename, b'\xdc36')
not found
>>> charinfile(filename, b'6') # make sure the code is working
found!

I've also tried versions where I use strings instead of bytes with the same results. I can confirm that there are no blank lines before EOF (have used line counters to verify that I reach ~1m rows).

What am I missing?

Question 2

When using the TEXT copy format, the backslash character has a special meaning. The line in your file that produces the failure contains a sequence like this:

FL33463円

334円 is interpreted as an octal number because of the the following rule:

\digits Backslash followed by one to three octal digits specifies the byte

334 in octal = 0xdc in hexadecimal, and 6 in ASCII = 0x36 in hexadecimal, and that makes the 0xdc 0x36 that the UTF8 decoder complains about.

The principled solution to that problem is to quote all backslashes in the input file by doubling them. Alternatively, you can use CSV with quotes and delimiters that don't exist in the file, as you did, but that's hazardous if the data is not clean.

Question 3

The suggested code from @laurenz-albe worked, but it also worked with UFT8 so the ENCODING wasn't actually the issue. After some testing, switching from FORMAT 'text' to FORMAT 'csv' was the critical change. I continued to use the recommended code points (they seem like a best practice to prevent character collision), resulting in a query like:

SELECT aws_s3.table_import_from_s3(
 'staging',
 '', 
 '(FORMAT ''csv'', DELIMITER E''\u0007'', QUOTE E''\u0004'')',
 aws_commons.create_s3_uri('data', 'data.txt', 'us-east-1')
);

... or in psql,

\copy public.staging FROM 'C:\\data.txt' (FORMAT 'csv', DELIMITER E'\u0007', QUOTE E'\u0004')

Question 4

Your data must have some different encoding. Connect to the database with psql and try

\copy public.staging FROM 'C:\\data.txt' (FORMAT 'csv', DELIMITER E'\u0007', QUOTE E'\u0004', ENCODING 'WIN1252')

This assumes that the ASCII characters with the code points 4 and 7 don't occur anywhere in your file.

Question 5

your query worked, but it also works with UTF8 and DELIMITER '|'. That helped me narrow the issue down to the FORMAT where (the default) text fails. Once I was using csv, the pgadmin4 dialog was probably doomed since you can't use the code point (which was wise). Had to use psql or aws_s3.table_import_from_s3. Do you want to edit your answer to blame the FORMAT so I can accept your answer?

Question 6

Hmm, that's odd, and I'd like to understand it first. That would mean that there really is no byte \xdc in the data. Then where does the error message come from? Mysterious. Just changing the FORMAT doesn't make input characters disappear...

Question 7

I had already tested win1252 (noted in the original post) and my python code should have found the offending character. That's why I tested variants after your code (unexpectedly) worked, e.g. \copy public.staging FROM 'c:\\data.txt' (FORMAT 'csv', DELIMITER E'\u0007', QUOTE E'\u0004') COPY 1173598. I agree that it's odd (why it never occurred to me to test csv before posting) but I can reliably reproduce.

Question 8

I believe you. Still, I'd have to understand where that byte comes from before I can confidently edit the answer.

Question 9

If you want to do a deeper dive, I extracted 20 lines around the one with the issue and can reproduce it. You can get the file at gist gist.github.com/claytondaley/01a99fece514c43bfe38f361ed64dca2 and I verified that I can reproduce locally with the downloaded gist.

score 1 · Accepted Answer · 2025-03-24 17:13:32Z

When using the TEXT copy format, the backslash character has a special meaning. The line in your file that produces the failure contains a sequence like this:

FL33463円

334円 is interpreted as an octal number because of the the following rule:

\digits Backslash followed by one to three octal digits specifies the byte

334 in octal = 0xdc in hexadecimal, and 6 in ASCII = 0x36 in hexadecimal, and that makes the 0xdc 0x36 that the UTF8 decoder complains about.

The principled solution to that problem is to quote all backslashes in the input file by doubling them. Alternatively, you can use CSV with quotes and delimiters that don't exist in the file, as you did, but that's hazardous if the data is not clean.

Stack Exchange Network

ERROR: invalid byte sequence for encoding "UTF8": 0xdc 0x36

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

ERROR: invalid byte sequence for encoding "UTF8": 0xdc 0x36

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions