python string encoding issues

Question 1

Is there a function in python that is equivalent to prefixing a string by 'u'?

Let's say I have a string:

a = 'C\xc3\xa9dric Roger'

and I want to convert it to:

b = u'C\xc3\xa9dric Roger'

so that I can compare it to other unicode objects. How can I do this? My first instinct was to try:

>>>> b = unicode(a)
Traceback (most recent call last):
File "<string>", line 1, in <fragment>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

But that seems to be trying to decode the string. Is there a function for casting to unicode without doing any kind of decoding? (Is that what the 'u' prefix does or have I misunderstood?)

Question 2

You need to specify an encoding:

unicode(a, 'utf8')

or, using str.decode():

a.decode('utf8')

but do pick the right codec for your input; you clearly have UTF-8 data here but that may not always be the case.

To understand what this does, I urge you to read:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

Question 3

Sorry if I'm being stupid here but unicode('C\xc3\xa9dric Roger','utf8') doesn't yield u'C\xc3\xa9dric Roger'...

Question 4

@JohnGreenall: No, because you now have a Unicode value; C3 A9 is the UTF-8 encoding for the U+00E9 codepoint in the Unicode standard, a.k.a. LATIN SMALL LETTER E WITH ACUTE. Python will display that as u'\xe9' when representing the unicode string.

Question 5

@JohnGreenall: Again, please do read the links included in my answer, there are some fundamental concepts you need to understand here.

Question 6

If you really want to get u'C\xc3\xa9dric Roger' then the encoding would be iso-8859-1, but as Martijn says that seems unlikely to be the right thing, unless the guy's name really is CÃ©dric (I'm glad I'm not called that).

Question 7

@JohnGreenall: Yes, if the Mongo driver is returning a Unicode value with UTF-8 bytes in it, then that is a bug in that driver, or someone inserted the value that way. You can encode to Latin 1 (which encodes the first 256 Unicode codepoints one-on-one to bytes), then decode from UTF-8: mongovalue.encode('latin1').decode('utf8'). That'll 'repair' the value.

Martijn Pieters 1.1m326 gold badges4.2k silver badges3.4k bronze badges · Accepted Answer · 2013-12-19 16:54:47Z

7

You need to specify an encoding:

unicode(a, 'utf8')

or, using str.decode():

a.decode('utf8')

but do pick the right codec for your input; you clearly have UTF-8 data here but that may not always be the case.

To understand what this does, I urge you to read:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

Share

Improve this answer

answered Dec 19, 2013 at 16:54

Martijn Pieters's user avatar

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

John Greenall

John Greenall Over a year ago

Sorry if I'm being stupid here but unicode('C\xc3\xa9dric Roger','utf8') doesn't yield u'C\xc3\xa9dric Roger'...

2013年12月19日T17:01:55.327Z+00:00

Martijn Pieters

Martijn Pieters Over a year ago

@JohnGreenall: No, because you now have a Unicode value; C3 A9 is the UTF-8 encoding for the U+00E9 codepoint in the Unicode standard, a.k.a. LATIN SMALL LETTER E WITH ACUTE. Python will display that as u'\xe9' when representing the unicode string.

2013年12月19日T17:04:48.09Z+00:00

Martijn Pieters

Martijn Pieters Over a year ago

@JohnGreenall: Again, please do read the links included in my answer, there are some fundamental concepts you need to understand here.

2013年12月19日T17:05:23.377Z+00:00

bobince

bobince Over a year ago

If you really want to get u'C\xc3\xa9dric Roger' then the encoding would be iso-8859-1, but as Martijn says that seems unlikely to be the right thing, unless the guy's name really is CÃ©dric (I'm glad I'm not called that).

2013年12月19日T17:10:43.75Z+00:00

Martijn Pieters

Martijn Pieters Over a year ago

@JohnGreenall: Yes, if the Mongo driver is returning a Unicode value with UTF-8 bytes in it, then that is a bug in that driver, or someone inserted the value that way. You can encode to Latin 1 (which encodes the first 256 Unicode codepoints one-on-one to bytes), then decode from UTF-8: mongovalue.encode('latin1').decode('utf8'). That'll 'repair' the value.

2013年12月19日T17:37:28.14Z+00:00

|

CollectivesTM on Stack Overflow

python string encoding issues

1 Answer 1

12 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

12 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related