Converting UTF-16 to UTF-8

Question 1

I've loading a string from a file. When I print out the string with:

print my_string
print binascii.hexlify(my_string)

I get:

2DF5
0032004400460035

Meaning this string is UTF-16. I would like to convert this string to UTF-8 so that the above code produces this output:

2DF5
32444635

I've tried:

my_string.decode('utf-8')

Which output:

32004400460035

EDIT:

Here's a quick sample:

 hello = 'hello'.encode('utf-16')
 print hello
 print binascii.hexlify(hello)
 hello = hello[2:].decode('utf-8')
 print hello
 print binascii.hexlify(hello)

Which produces this output:

��hello
fffe680065006c006c006f00
hello
680065006c006c006f00

Expected output would be:

��hello
fffe680065006c006c006f00
hello
68656c6c6f

Question 2

That's not exactly a difficult task. What have you tried and where did you get stuck?

Question 3

Also, you have UTF-16 data without a BOM. Judging by the leading nulls you have big-endian UTF-16, but this is probably only partial data? Where does the data come from?

Question 4

@MartijnPieters Updated with what I tried. The output is loaded from a file generated by a program on Windows.

Question 5

So if the data is encoded to UTF-16, why are you decoding it as UTF-8? Decoding takes bytes and produces a unicode object.

Question 6

Your output also doesn't make sense in that you are now missing a 00 byte.

Question 7

Your string appears to have been encoded using utf-16be:

In [9]: s = "2DF5".encode("utf-16be")
In [11]: print binascii.hexlify(s)
0032004400460035

So, in order to convert it to utf-8, you first need to decode it, then encode it:

In [14]: uni = s.decode("utf-16be")
In [15]: uni
Out[15]: u'2DF5'
In [16]: utf = uni.encode("utf-8")
In [17]: utf
Out[17]: '2DF5'

or, in one step:

In [13]: s.decode("utf-16be").encode("utf-8")
Out[13]: '2DF5'

Question 8

But take into account there could be a BOM in the actual, real-world data.

Question 9

Thank you, I was not aware of UTF-16be and that was the issue!

Question 10

@Juicy: Note that you have a BOM in your actual data; there is no need to pick be or le when you have a BOM, just decode as UTF-16 and the BOM is then not part of the decoded value.

Question 11

@MartijnPieters Thanks, TBH I don't script things like this very often and didn't even know what a BOM is. I'll read up on it for the future!

Tim Pietzcker 338k59 gold badges521 silver badges572 bronze badges · Accepted Answer · 2015-07-03 12:55:03Z

6

Your string appears to have been encoded using utf-16be:

In [9]: s = "2DF5".encode("utf-16be")
In [11]: print binascii.hexlify(s)
0032004400460035

So, in order to convert it to utf-8, you first need to decode it, then encode it:

In [14]: uni = s.decode("utf-16be")
In [15]: uni
Out[15]: u'2DF5'
In [16]: utf = uni.encode("utf-8")
In [17]: utf
Out[17]: '2DF5'

or, in one step:

In [13]: s.decode("utf-16be").encode("utf-8")
Out[13]: '2DF5'

Share

Improve this answer

edited Jul 3, 2015 at 12:56

answered Jul 3, 2015 at 12:55

Tim Pietzcker's user avatar

Tim Pietzcker

338k59 gold badges521 silver badges572 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Martijn Pieters

Martijn Pieters Over a year ago

But take into account there could be a BOM in the actual, real-world data.

2015年07月03日T12:55:45.157Z+00:00

Juicy

Juicy Over a year ago

Thank you, I was not aware of UTF-16be and that was the issue!

2015年07月03日T12:57:47.84Z+00:00

Martijn Pieters

Martijn Pieters Over a year ago

@Juicy: Note that you have a BOM in your actual data; there is no need to pick be or le when you have a BOM, just decode as UTF-16 and the BOM is then not part of the decoded value.

2015年07月03日T13:05:48.483Z+00:00

Juicy

Juicy Over a year ago

@MartijnPieters Thanks, TBH I don't script things like this very often and didn't even know what a BOM is. I'll read up on it for the future!

2015年07月03日T13:19:07.02Z+00:00

CollectivesTM on Stack Overflow

Converting UTF-16 to UTF-8

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related