I have a CSV file, appears to be UTF-16, dumped from SQL Server. This file contains properly encoded accents (spanish) but some of the rows are encoded differently. Like this:
0xd83d0xde1b0xd83d0xde1b0xd83d0xde1b
This seems to be a strange encoding for
\ud83d\ude1b\ud83d\ude1b\ud83d\ude1b
\ud83d\ude1b are surrogate pairs for an emoji
I need to convert everything to a nice, neat UTF-8 file. I tried endless combinations of bytearray(), encode(), decode(), and so on.
How can I convert this file of mixed UTF-16 and escaped UTF-16 into proper Python 3 strings, and finally save them to a new UTF-8 file?
1 Answer 1
You can convert the hex data like this:
>>> import binascii
>>> s = '0xd83d0xde1b0xd83d0xde1b0xd83d0xde1b'
>>> # Remove the leading '0x'
>>> hs = s.replace('0x', '')
>>> # Convert from hex to bytes
>>> bs = binascii.unhexlify(hs)
>>> bs
b'\xd8=\xde\x1b\xd8=\xde\x1b\xd8=\xde\x1b'
# Decode to str
>>> bs.decode('utf-16be')
'πππ'