Is there a function in python that is equivalent to prefixing a string by 'u'?
Let's say I have a string:
a = 'C\xc3\xa9dric Roger'
and I want to convert it to:
b = u'C\xc3\xa9dric Roger'
so that I can compare it to other unicode objects. How can I do this? My first instinct was to try:
>>>> b = unicode(a)
Traceback (most recent call last):
File "<string>", line 1, in <fragment>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
But that seems to be trying to decode the string. Is there a function for casting to unicode without doing any kind of decoding? (Is that what the 'u' prefix does or have I misunderstood?)
1 Answer 1
You need to specify an encoding:
unicode(a, 'utf8')
or, using str.decode():
a.decode('utf8')
but do pick the right codec for your input; you clearly have UTF-8 data here but that may not always be the case.
To understand what this does, I urge you to read:
12 Comments
u'\xe9' when representing the unicode string.u'C\xc3\xa9dric Roger' then the encoding would be iso-8859-1, but as Martijn says that seems unlikely to be the right thing, unless the guy's name really is Cédric (I'm glad I'm not called that).mongovalue.encode('latin1').decode('utf8'). That'll 'repair' the value.