I'm working on a script that will ETL data from an Oracle database to PostgreSQL. I'm using jaydebeapi to connect to Oracle and psycopy2 for PSQL. I am loading the data into PSQL by streaming the data into the copy_from function -- this worked well for the my ETL from a MySQL database. I'm having a bit of an issue with one string, but I'm sure their may be others. I have a function that evaluates every field in the result set from Oracle and cleans it up if it's a string. In the source database Doña Ana is stored in the county table, but it's stored as Do\xf1a Ana, so when I try to load this in PSQL, it's throwing:
invalid byte sequence for encoding "UTF8": 0xf1 0x61 0x20 0x41
import six
import unicodedata
def prepdata(value):
encodedvalue = bytearray(value, 'utf-8')
print(encodedvalue)
decodedvalue = encodedvalue.decode('utf-8')
print(decodedvalue)
cleanedvalue = unicodedata.normalize(u'NFD', decodedvalue).encode('ASCII', 'ignore').decode('utf-8')
print(cleanedvalue)
return cleanedvalue
OUTPUT:
b'Do\\xf1a Ana'
Do\xf1a Ana
Do\xf1a Ana
It looks like when I try to encode Do\xf1a Ana it's just escaping the backslach rather converting it.
When I try normalizing the string using the interpreter:
>>> x = 'Do\xf1a Ana'
>>> x
'Doña Ana'
>>> p = bytearray(x,'utf-8')
>>> p
bytearray(b'Do\xc3\xb1a Ana')
>>> a = p.decode('utf-8')
>>> a
'Doña Ana'
>>> normal = unicodedata.normalize('NFKD', a).encode('ASCII', 'ignore').decode('utf-8')
>>> normal
'Dona Ana'
Can anyone explain what's going on? Obviously the value coming from the database has something going on with it even though it's coming across as a str.
-
So your code works just fine in the interpreter? At what line do you get the error? I rand all of your code and it worked in my interpreter.bart cubrich– bart cubrich2019年08月01日 17:48:11 +00:00Commented Aug 1, 2019 at 17:48
-
yes, works fine in the interpreter, but when I run the script with actual data from the database, it fails with the 'Do\xf1a Ana' value. In this case It's failing when it's attempting to load the data into PSQL - the database is encoded as UTF-8. I don't fully understand the encoding/decoding stuff, but I believe the database should accept the letter 'n' with a tilde.jlllllll– jlllllll2019年08月01日 18:08:44 +00:00Commented Aug 1, 2019 at 18:08
1 Answer 1
I was able to get this work using the `unicode_escape' decoding after I an initial encoding of the string to get it to bytes.
def prepdata(value):
encodedvalue = value.encode()
decodedvalue = encodedvalue.decode('unicode_escape')
cleanedvalue = decodedvalue.replace("\r"," ")
# there are also a list of other things happening below
# cleaning the string of things that may cause issues like '\n'.
return cleanedvalue