String Encoding/Decoding Issues with DB Data

Question 1

I'm working on a script that will ETL data from an Oracle database to PostgreSQL. I'm using jaydebeapi to connect to Oracle and psycopy2 for PSQL. I am loading the data into PSQL by streaming the data into the copy_from function -- this worked well for the my ETL from a MySQL database. I'm having a bit of an issue with one string, but I'm sure their may be others. I have a function that evaluates every field in the result set from Oracle and cleans it up if it's a string. In the source database Doña Ana is stored in the county table, but it's stored as Do\xf1a Ana, so when I try to load this in PSQL, it's throwing:

invalid byte sequence for encoding "UTF8": 0xf1 0x61 0x20 0x41

import six
import unicodedata
def prepdata(value): 
 encodedvalue = bytearray(value, 'utf-8')
 print(encodedvalue)
 decodedvalue = encodedvalue.decode('utf-8')
 print(decodedvalue)
 cleanedvalue = unicodedata.normalize(u'NFD', decodedvalue).encode('ASCII', 'ignore').decode('utf-8')
 print(cleanedvalue)
 return cleanedvalue

OUTPUT:

b'Do\\xf1a Ana' 
Do\xf1a Ana
Do\xf1a Ana

It looks like when I try to encode Do\xf1a Ana it's just escaping the backslach rather converting it.

When I try normalizing the string using the interpreter:

>>> x = 'Do\xf1a Ana'
>>> x
'Doña Ana'
>>> p = bytearray(x,'utf-8')
>>> p
bytearray(b'Do\xc3\xb1a Ana')
>>> a = p.decode('utf-8')
>>> a
'Doña Ana'
>>> normal = unicodedata.normalize('NFKD', a).encode('ASCII', 'ignore').decode('utf-8')
>>> normal
'Dona Ana'

Can anyone explain what's going on? Obviously the value coming from the database has something going on with it even though it's coming across as a str.

Question 2

So your code works just fine in the interpreter? At what line do you get the error? I rand all of your code and it worked in my interpreter.

Question 3

yes, works fine in the interpreter, but when I run the script with actual data from the database, it fails with the 'Do\xf1a Ana' value. In this case It's failing when it's attempting to load the data into PSQL - the database is encoded as UTF-8. I don't fully understand the encoding/decoding stuff, but I believe the database should accept the letter 'n' with a tilde.

Question 4

I was able to get this work using the `unicode_escape' decoding after I an initial encoding of the string to get it to bytes.

def prepdata(value): 
 encodedvalue = value.encode()
 decodedvalue = encodedvalue.decode('unicode_escape')
 cleanedvalue = decodedvalue.replace("\r"," ")
 # there are also a list of other things happening below 
 # cleaning the string of things that may cause issues like '\n'.
 return cleanedvalue

jlllllll 235 bronze badges · Accepted Answer · 2019-08-01 19:16:44Z

I was able to get this work using the `unicode_escape' decoding after I an initial encoding of the string to get it to bytes.

def prepdata(value): 
 encodedvalue = value.encode()
 decodedvalue = encodedvalue.decode('unicode_escape')
 cleanedvalue = decodedvalue.replace("\r"," ")
 # there are also a list of other things happening below 
 # cleaning the string of things that may cause issues like '\n'.
 return cleanedvalue

CollectivesTM on Stack Overflow

String Encoding/Decoding Issues with DB Data

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related