I've loading a string from a file. When I print out the string with:
print my_string
print binascii.hexlify(my_string)
I get:
2DF5
0032004400460035
Meaning this string is UTF-16. I would like to convert this string to UTF-8 so that the above code produces this output:
2DF5
32444635
I've tried:
my_string.decode('utf-8')
Which output:
32004400460035
EDIT:
Here's a quick sample:
hello = 'hello'.encode('utf-16')
print hello
print binascii.hexlify(hello)
hello = hello[2:].decode('utf-8')
print hello
print binascii.hexlify(hello)
Which produces this output:
��hello
fffe680065006c006c006f00
hello
680065006c006c006f00
Expected output would be:
��hello
fffe680065006c006c006f00
hello
68656c6c6f
asked Jul 3, 2015 at 12:47
Juicy
12.6k40 gold badges135 silver badges221 bronze badges
1 Answer 1
Your string appears to have been encoded using utf-16be:
In [9]: s = "2DF5".encode("utf-16be")
In [11]: print binascii.hexlify(s)
0032004400460035
So, in order to convert it to utf-8, you first need to decode it, then encode it:
In [14]: uni = s.decode("utf-16be")
In [15]: uni
Out[15]: u'2DF5'
In [16]: utf = uni.encode("utf-8")
In [17]: utf
Out[17]: '2DF5'
or, in one step:
In [13]: s.decode("utf-16be").encode("utf-8")
Out[13]: '2DF5'
answered Jul 3, 2015 at 12:55
Tim Pietzcker
338k59 gold badges521 silver badges572 bronze badges
Sign up to request clarification or add additional context in comments.
4 Comments
Martijn Pieters
But take into account there could be a BOM in the actual, real-world data.
Juicy
Thank you, I was not aware of
UTF-16be and that was the issue!Martijn Pieters
@Juicy: Note that you have a BOM in your actual data; there is no need to pick
be or le when you have a BOM, just decode as UTF-16 and the BOM is then not part of the decoded value.Juicy
@MartijnPieters Thanks, TBH I don't script things like this very often and didn't even know what a BOM is. I'll read up on it for the future!
lang-py
unicodeobject.00byte.