1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

Unicode and string with encoding

Asked 11 years, 9 months ago

Viewed 143 times

Why does this work:

a = 'a'.encode('utf-8')
print unicode(a)
>>> u'a'

And this will give me an Error:

b = 'b'.encode('utf-8_sig')
print unicode(b)

Saying:
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

Improve this question

asked Mar 18, 2014 at 12:13

uloco's user avatar

uloco

2,4214 gold badges25 silver badges40 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default

Because you haven't told unicode what encoding to use:

>>> a = 'a'.encode('utf-8')
>>> print unicode(a)
a
>>> b = 'b'.encode('utf-8_sig')
>>> print unicode(b)
Traceback (most recent call last):
 File "<pyshell#3>", line 1, in <module>
 print unicode(b)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
>>> print unicode(b, 'utf-8_sig')
b

Improve this answer

answered Mar 18, 2014 at 12:19

jonrsharpe's user avatar

jonrsharpe

123k31 gold badges278 silver badges489 bronze badges

2 Comments

uloco

uloco Over a year ago

Is unicode using 'utf-8' bei default or why i dont get an error in the first issue?

2014年03月18日T13:38:11.003Z+00:00

jonrsharpe

jonrsharpe Over a year ago

'a'.encode('utf-8') is just 'a', so you don't need to tell unicode how to deal with it, whereas 'b'.encode('utf-8_sig') is '\xef\xbb\xbfb'

2014年03月18日T13:40:46.02Z+00:00

'ascii' codec can't decode byte 0xef says two things:

unicode(b) uses ascii (sys.getdefaultencoding()) character encoding
\xef byte is not in ascii range. It is the first byte in BOM introduced by 'utf-8-sig' encoding (used on Windows)

The first example works because 'a' bytestring is ascii. 'a'.encode('utf-8') is equivalent to 'a'.decode(sys.getdefaultencoding()).encode('utf-8') and in this case it is equal to 'a' itself.

In general, use bytestring.decode(character_encoding) = unicode_string and unicode_string.encode(character_encoding) = bytestring. bytestring is a sequence of bytes. Unicode string is a sequence of Unicode codepoints.

Do not call .encode() on bytestrings. 'a' is a bytestring literal in Python 2. u'a' is a Unicode literal.

Improve this answer

answered Mar 18, 2014 at 14:06

jfs's user avatar

jfs

417k211 gold badges1k silver badges1.7k bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Unicode and string with encoding

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related