3

I've loading a string from a file. When I print out the string with:

print my_string
print binascii.hexlify(my_string)

I get:

2DF5
0032004400460035

Meaning this string is UTF-16. I would like to convert this string to UTF-8 so that the above code produces this output:

2DF5
32444635

I've tried:

my_string.decode('utf-8')

Which output:

32004400460035

EDIT:

Here's a quick sample:

 hello = 'hello'.encode('utf-16')
 print hello
 print binascii.hexlify(hello)
 hello = hello[2:].decode('utf-8')
 print hello
 print binascii.hexlify(hello)

Which produces this output:

��hello
fffe680065006c006c006f00
hello
680065006c006c006f00

Expected output would be:

��hello
fffe680065006c006c006f00
hello
68656c6c6f
asked Jul 3, 2015 at 12:47
7
  • That's not exactly a difficult task. What have you tried and where did you get stuck? Commented Jul 3, 2015 at 12:50
  • Also, you have UTF-16 data without a BOM. Judging by the leading nulls you have big-endian UTF-16, but this is probably only partial data? Where does the data come from? Commented Jul 3, 2015 at 12:52
  • @MartijnPieters Updated with what I tried. The output is loaded from a file generated by a program on Windows. Commented Jul 3, 2015 at 12:53
  • So if the data is encoded to UTF-16, why are you decoding it as UTF-8? Decoding takes bytes and produces a unicode object. Commented Jul 3, 2015 at 12:53
  • Your output also doesn't make sense in that you are now missing a 00 byte. Commented Jul 3, 2015 at 12:53

1 Answer 1

6

Your string appears to have been encoded using utf-16be:

In [9]: s = "2DF5".encode("utf-16be")
In [11]: print binascii.hexlify(s)
0032004400460035

So, in order to convert it to utf-8, you first need to decode it, then encode it:

In [14]: uni = s.decode("utf-16be")
In [15]: uni
Out[15]: u'2DF5'
In [16]: utf = uni.encode("utf-8")
In [17]: utf
Out[17]: '2DF5'

or, in one step:

In [13]: s.decode("utf-16be").encode("utf-8")
Out[13]: '2DF5'
answered Jul 3, 2015 at 12:55
Sign up to request clarification or add additional context in comments.

4 Comments

But take into account there could be a BOM in the actual, real-world data.
Thank you, I was not aware of UTF-16be and that was the issue!
@Juicy: Note that you have a BOM in your actual data; there is no need to pick be or le when you have a BOM, just decode as UTF-16 and the BOM is then not part of the decoded value.
@MartijnPieters Thanks, TBH I don't script things like this very often and didn't even know what a BOM is. I'll read up on it for the future!

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.