Edit - Stack Overflow

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

Bytes in a unicode Python string

In Python 2, Unicode strings may contain both unicode and bytes:

a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.

The bytes in the string above are UTF-8 for ек (Unicode \u0435\u043a).

My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек (\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a).

Encoding it to UTF-8 yields

>>> a.encode('utf-8')
'\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba'

Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:

>>> a.encode('utf-8').decode('utf-8')
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

I found a hacky way to solve the problem, however:

>>> repr(a)
"u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'"
>>> eval(repr(a)[1:])
'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba'
>>> s = eval(repr(a)[1:]).decode('utf8')
>>> s
u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a'
# Almost there, the bytes are proper now but the former real-unicode characters
# are now escaped with \u's; need to un-escape them.
>>> import re
>>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success!

This works fine but looks very hacky due to its use of eval, repr, and then additional regex'ing of the unicode string representation. Is there a cleaner way?

Answer*

Draft saved

Draft discarded

Edit Summary*

Cancel

I guess I do have to make that assumption in order to get the unicode string to behave. I understand that it may break in case the assumption fails and the string should hold a < 256 code-point, but at this point I think that trusting the assumption would do more good than harm. In retrospect, kev's answer does exactly that, but I'd rather accept your answer because it explains why it is a bad idea to do this in general. Thanks~

Etienne Perot
– Etienne Perot

2012年03月23日 21:22:29 +00:00
Commented Mar 23, 2012 at 21:22
1

You can isolate high-order ASCII chars (x80-xFF) and then try converting them from utf8. If this succeeds, this is most probably correct, because normal texts are unlikely to contain utf8 sequences (ÃƒÂ® anyone?), otherwise leave them as is.

georg
– georg

2012年03月23日 21:38:49 +00:00
Commented Mar 23, 2012 at 21:38
@thg435 That’s exactly what my easy Perl solution does; but for some reason in Python you have go through a lot more hassle; see @Kev’s answers and the comments. I’m surprised the accepted answer hasn’t shown exactly how to do it.

tchrist
– tchrist

2012年03月23日 21:56:42 +00:00
Commented Mar 23, 2012 at 21:56
@tchrist: I posted an example of what I meant, it's more verbose than your perl snippet, but still concise.

georg
– georg

2012年03月23日 22:13:50 +00:00
Commented Mar 23, 2012 at 22:13
> "It just happens that the repr for Unicode strings in [Python] prefers to represent characters with \x escapes where possible" — Indeed, and this seems to be the relevant code (as of today) in the CPython source which decides how to escape characters. Or you can just try something like: for n in range(300): print hex(n), repr(unichr(n)) or (in Python 3) for n in range(900): print(hex(n), repr(chr(n)), ascii(chr(n))).

ShreevatsaR
– ShreevatsaR

2016年11月28日 19:10:51 +00:00
Commented Nov 28, 2016 at 19:10

Add a comment |

How to Edit

Correct minor typos or mistakes
Clarify meaning without changing it
Add related resources or links
Always respect the author’s intent
Don’t use edits to reply to the author

How to Format

create code fences with backticks ` or tildes ~
```
like so
```
add language identifier to highlight code
```python
def function(foo):
print(foo)
```
put returns between paragraphs
for linebreak add 2 spaces at end
_italic_ or **bold**
indent code by 4 spaces
backtick escapes `like _so_`
quote by placing > at start of line
to make links (use https whenever possible)

<https://example.com>

[example](https://example.com)

<a href="https://example.com">example</a>

formatting help »
answering help »

How to Tag

A tag is a keyword or label that categorizes your question with other, similar questions. Choose one or more (up to 5) tags that will help answerers to find and interpret your question.

complete the sentence: my question is about...
use tags that describe things or concepts that are essential, not incidental to your question
favor using existing popular tags
read the descriptions that appear below the tag

If your question is primarily about a topic for which you can't find a tag:

combine multiple words into single-words with hyphens (e.g. python-3.x), up to a maximum of 35 characters
creating new tags is a privilege; if you can't yet create a tag you need, then post this question without it, then ask the community to create it for you

popular tags »

lang-py

CollectivesTM on Stack Overflow

Bytes in a unicode Python string

Answer*