Skip to main content
Stack Overflow
  1. About
  2. For Teams

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

Bytes in a unicode Python string

In Python 2, Unicode strings may contain both unicode and bytes:

a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.

The bytes in the string above are UTF-8 for ек (Unicode \u0435\u043a).

My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек (\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a).

Encoding it to UTF-8 yields

>>> a.encode('utf-8')
'\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba'

Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:

>>> a.encode('utf-8').decode('utf-8')
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

I found a hacky way to solve the problem, however:

>>> repr(a)
"u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'"
>>> eval(repr(a)[1:])
'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba'
>>> s = eval(repr(a)[1:]).decode('utf8')
>>> s
u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a'
# Almost there, the bytes are proper now but the former real-unicode characters
# are now escaped with \u's; need to un-escape them.
>>> import re
>>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success!

This works fine but looks very hacky due to its use of eval, repr, and then additional regex'ing of the unicode string representation. Is there a cleaner way?

Answer*

Draft saved
Draft discarded
Cancel
5
  • I guess I do have to make that assumption in order to get the unicode string to behave. I understand that it may break in case the assumption fails and the string should hold a < 256 code-point, but at this point I think that trusting the assumption would do more good than harm. In retrospect, kev's answer does exactly that, but I'd rather accept your answer because it explains why it is a bad idea to do this in general. Thanks~ Commented Mar 23, 2012 at 21:22
  • 1
    You can isolate high-order ASCII chars (x80-xFF) and then try converting them from utf8. If this succeeds, this is most probably correct, because normal texts are unlikely to contain utf8 sequences (î anyone?), otherwise leave them as is. Commented Mar 23, 2012 at 21:38
  • @thg435 That’s exactly what my easy Perl solution does; but for some reason in Python you have go through a lot more hassle; see @Kev’s answers and the comments. I’m surprised the accepted answer hasn’t shown exactly how to do it. Commented Mar 23, 2012 at 21:56
  • @tchrist: I posted an example of what I meant, it's more verbose than your perl snippet, but still concise. Commented Mar 23, 2012 at 22:13
  • > "It just happens that the repr for Unicode strings in [Python] prefers to represent characters with \x escapes where possible" — Indeed, and this seems to be the relevant code (as of today) in the CPython source which decides how to escape characters. Or you can just try something like: for n in range(300): print hex(n), repr(unichr(n)) or (in Python 3) for n in range(900): print(hex(n), repr(chr(n)), ascii(chr(n))). Commented Nov 28, 2016 at 19:10

lang-py

AltStyle によって変換されたページ (->オリジナル) /