Confusion about python unicode

Asked 12 years, 5 months ago

Viewed 218 times

I have writen a python file in windows7 by sublime text, there are some Chinese characters in the file, so when I run it, the characters become unrecognizable (the same occurs in cmd and git bash):

# -*- coding: utf-8 -*- 
str = "测试"
print str
arr = []
arr.append(str)
print arr

the result is:

娴嬭瘯
['\xe6\xb5\x8b\xe8\xaf\x95']

How can I solve this problem? what is the cause of this problem? and the arr print result shouldn't be unicode like \uXXX?

By the way, without the # -*- coding: utf-8 -*- I can't even run it:

$ python test.py
 File "test.py", line 2
SyntaxError: Non-ASCII character '\xe6' in file test.py on line 2, but no encodi
ng declared; see http://www.python.org/peps/pep-0263.html for details

I just googled the statement, why can't the code run without it?

Improve this question

edited Aug 4, 2013 at 11:09

Codie CodeMonkey's user avatar

Codie CodeMonkey

8,0662 gold badges31 silver badges49 bronze badges

asked Aug 4, 2013 at 10:57

hh54188's user avatar

hh54188

15.7k35 gold badges117 silver badges194 bronze badges

What version of Python are you running this on?

Codie CodeMonkey
– Codie CodeMonkey

2013年08月04日 11:01:47 +00:00
Commented Aug 4, 2013 at 11:01
@CodieCodeMonkey: 2.7.5

hh54188
– hh54188

2013年08月04日 11:04:21 +00:00
Commented Aug 4, 2013 at 11:04
Since unicode handling is different in 3, I'll add Python 2.7 as a tag.

Codie CodeMonkey
– Codie CodeMonkey

2013年08月04日 11:08:28 +00:00
Commented Aug 4, 2013 at 11:08
Which encoding is used by your terminal?

Tim Pietzcker
– Tim Pietzcker

2013年08月04日 11:12:35 +00:00
Commented Aug 4, 2013 at 11:12
1

The lack of a \uXXX is because it isn't an unicode string, but a byte sequence. Prepend a u to "测试" for an unicode string.

Wessie
– Wessie

2013年08月04日 11:13:34 +00:00
Commented Aug 4, 2013 at 11:13

| Show 1 more comment

2 Answers 2

Sorted by: Reset to default

# -*- coding: utf-8 -*- is needed to specify the encoding used in file.

You're getting ['\xe6\xb5\x8b\xe8\xaf\x95'] as output because your string is a byte string not a unicode string, add a u prefix to the string to convert it to a unicode string.

>>> strs = u"测试"
>>> lis = [strs]
>>> print lis
[u'\u6d4b\u8bd5']
>>> print lis[0]
测试

Improve this answer

edited Aug 4, 2013 at 11:19

answered Aug 4, 2013 at 11:02

Ashwini Chaudhary's user avatar

Ashwini Chaudhary

252k60 gold badges479 silver badges520 bronze badges

Comments

You're seeing the UTF-8-encoded version of your string (which you shouldn't name str, by the way). By adding the # -*- coding: utf-8 -*- line at the start of your script, you're telling Python that that's the encoding your script is using. Are you sure that it is in fact using that encoding?

If that's not the case (check your editor!) or if your terminal window (where you're printing the string) happens to be using a different encoding, you'll get gibberish (or errors if the encoded string can't be interpreted in that encoding).

Only if you decode your (byte)string, you'll get a Unicode object.

So first you need to know your terminal's character encoding. Then you should be converting all strings to Unicode as soon as possible and manipulate only Unicode objects in your program until it's time to output them - at which point you need to encode them to the correct encoding.

For example

# -*- coding: utf-8 -*- 
s = u"测试"
s = s + u"娴嬭瘯"
print s.encode("somecodepage")

Improve this answer

edited Aug 4, 2013 at 11:18

answered Aug 4, 2013 at 11:02

Tim Pietzcker's user avatar

Tim Pietzcker

338k59 gold badges521 silver badges572 bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Confusion about python unicode

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related