When running a \copy
(either pgadmin or aws_s3.table_import_from_s3
) of a 1.6GB file into an AWS Aurora Postgres-compatible database, I'm getting the following error:
ERROR: invalid byte sequence for encoding "UTF8": 0xdc 0x36
CONTEXT: COPY staging, line 99779: "L24000403170365 ACTIVEZONE LLC ..."
EDIT: Here's what I could pull for table definition (but let me know if you want more):
column_name | data_type | character_maximum_length | is_nullable | column_default |
---|---|---|---|---|
raw | text | [null] | YES | [null] |
EDIT: I also tried to change the column to bytea
with no effect.
The source is supposed to be ASCII, but I get the same error with explicit encodings like utf8
, latin1
, win1251
, and win1252
.
EDIT: As requested in a reply, here's more information about the import commands. In pgadmin4, I'm right-click importing into the table which shows the following under the covers:
--command " "\\copy public.staging (\"raw\") FROM 'C:/data.txt' DELIMITER '|' ENCODING 'UTF8';""
I also use pgadmin4 to trigger the s3 table import by calling the query:
SELECT aws_s3.table_import_from_s3(
'staging',
'',
'(DELIMITER ''|'')',
aws_commons.create_s3_uri('data', 'data.txt', 'us-east-1')
);
Under the covers, table_import_from_s3
calls the command:
copy staging from '/rdsdbdata/extensions/aws_s3/{{internal filename}}' with (DELIMITER '|')
The answer to similar questions is to clean up the source data so I pulled up python and tried to find the offending character. I couldn't find any evidence of an unusual character at or around the referenced line. For the sake of argument, I believe the following will scan the entire file (and you can see the results inline):
>>> def charinfile(filename, bytechar):
... with open(filename, 'rb') as file:
... byteline = file.readline()
... while byteline: # readline returns empty string at EOF
... if byteline.find(bytechar) != -1:
... print("found!")
... return byteline
... byteline = file.readline()
... else:
... print("not found")
...
>>> charinfile(filename, b'\xdc')
not found
>>> charinfile(filename, b'\xdc36')
not found
>>> charinfile(filename, b'6') # make sure the code is working
found!
I've also tried versions where I use strings instead of bytes with the same results. I can confirm that there are no blank lines before EOF (have used line counters to verify that I reach ~1m rows).
What am I missing?
3 Answers 3
When using the TEXT
copy format, the backslash character has a special meaning.
The line in your file that produces the failure contains a sequence like this:
FL33463円
334円
is interpreted as an octal number because of the the following rule:
\digits Backslash followed by one to three octal digits specifies the byte
334
in octal = 0xdc
in hexadecimal, and 6
in ASCII = 0x36
in hexadecimal, and that makes the 0xdc 0x36
that the UTF8 decoder complains about.
The principled solution to that problem is to quote all backslashes in the input file by doubling them. Alternatively, you can use CSV with quotes and delimiters that don't exist in the file, as you did, but that's hazardous if the data is not clean.
The suggested code from @laurenz-albe worked, but it also worked with UFT8
so the ENCODING
wasn't actually the issue. After some testing, switching from FORMAT 'text'
to FORMAT 'csv'
was the critical change. I continued to use the recommended code points (they seem like a best practice to prevent character collision), resulting in a query like:
SELECT aws_s3.table_import_from_s3(
'staging',
'',
'(FORMAT ''csv'', DELIMITER E''\u0007'', QUOTE E''\u0004'')',
aws_commons.create_s3_uri('data', 'data.txt', 'us-east-1')
);
... or in psql
,
\copy public.staging FROM 'C:\\data.txt' (FORMAT 'csv', DELIMITER E'\u0007', QUOTE E'\u0004')
Your data must have some different encoding. Connect to the database with psql
and try
\copy public.staging FROM 'C:\\data.txt' (FORMAT 'csv', DELIMITER E'\u0007', QUOTE E'\u0004', ENCODING 'WIN1252')
This assumes that the ASCII characters with the code points 4 and 7 don't occur anywhere in your file.
-
your query worked, but it also works with
UTF8
andDELIMITER '|'
. That helped me narrow the issue down to theFORMAT
where (the default)text
fails. Once I was usingcsv
, the pgadmin4 dialog was probably doomed since you can't use the code point (which was wise). Had to usepsql
oraws_s3.table_import_from_s3
. Do you want to edit your answer to blame theFORMAT
so I can accept your answer?claytond– claytond2025年03月14日 17:03:01 +00:00Commented Mar 14 at 17:03 -
Hmm, that's odd, and I'd like to understand it first. That would mean that there really is no byte
\xdc
in the data. Then where does the error message come from? Mysterious. Just changing theFORMAT
doesn't make input characters disappear...Laurenz Albe– Laurenz Albe2025年03月14日 20:30:26 +00:00Commented Mar 14 at 20:30 -
I had already tested
win1252
(noted in the original post) and my python code should have found the offending character. That's why I tested variants after your code (unexpectedly) worked, e.g.\copy public.staging FROM 'c:\\data.txt' (FORMAT 'csv', DELIMITER E'\u0007', QUOTE E'\u0004') COPY 1173598
. I agree that it's odd (why it never occurred to me to testcsv
before posting) but I can reliably reproduce.claytond– claytond2025年03月14日 21:01:26 +00:00Commented Mar 14 at 21:01 -
I believe you. Still, I'd have to understand where that byte comes from before I can confidently edit the answer.Laurenz Albe– Laurenz Albe2025年03月14日 21:15:30 +00:00Commented Mar 14 at 21:15
-
1If you want to do a deeper dive, I extracted 20 lines around the one with the issue and can reproduce it. You can get the file at gist gist.github.com/claytondaley/01a99fece514c43bfe38f361ed64dca2 and I verified that I can reproduce locally with the downloaded gist.claytond– claytond2025年03月16日 16:38:10 +00:00Commented Mar 16 at 16:38
Explore related questions
See similar questions with these tags.