Why does this work:
a = 'a'.encode('utf-8')
print unicode(a)
>>> u'a'
And this will give me an Error:
b = 'b'.encode('utf-8_sig')
print unicode(b)
Saying: >>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
2 Answers 2
Because you haven't told unicode what encoding to use:
>>> a = 'a'.encode('utf-8')
>>> print unicode(a)
a
>>> b = 'b'.encode('utf-8_sig')
>>> print unicode(b)
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
print unicode(b)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
>>> print unicode(b, 'utf-8_sig')
b
2 Comments
'a'.encode('utf-8') is just 'a', so you don't need to tell unicode how to deal with it, whereas 'b'.encode('utf-8_sig') is '\xef\xbb\xbfb''ascii' codec can't decode byte 0xef says two things:
unicode(b)usesascii(sys.getdefaultencoding()) character encoding\xefbyte is not in ascii range. It is the first byte inBOMintroduced by'utf-8-sig'encoding (used on Windows)
The first example works because 'a' bytestring is ascii. 'a'.encode('utf-8') is equivalent to 'a'.decode(sys.getdefaultencoding()).encode('utf-8') and in this case it is equal to 'a' itself.
In general, use bytestring.decode(character_encoding) = unicode_string and unicode_string.encode(character_encoding) = bytestring. bytestring is a sequence of bytes. Unicode string is a sequence of Unicode codepoints.
Do not call .encode() on bytestrings. 'a' is a bytestring literal in Python 2. u'a' is a Unicode literal.
Comments
Explore related questions
See similar questions with these tags.