1

I have writen a python file in windows7 by sublime text, there are some Chinese characters in the file, so when I run it, the characters become unrecognizable (the same occurs in cmd and git bash):

# -*- coding: utf-8 -*- 
str = "测试"
print str
arr = []
arr.append(str)
print arr

the result is:

娴嬭瘯
['\xe6\xb5\x8b\xe8\xaf\x95']

How can I solve this problem? what is the cause of this problem? and the arr print result shouldn't be unicode like \uXXX?

By the way, without the # -*- coding: utf-8 -*- I can't even run it:

$ python test.py
 File "test.py", line 2
SyntaxError: Non-ASCII character '\xe6' in file test.py on line 2, but no encodi
ng declared; see http://www.python.org/peps/pep-0263.html for details

I just googled the statement, why can't the code run without it?

Codie CodeMonkey
8,0662 gold badges31 silver badges49 bronze badges
asked Aug 4, 2013 at 10:57
6
  • What version of Python are you running this on? Commented Aug 4, 2013 at 11:01
  • @CodieCodeMonkey: 2.7.5 Commented Aug 4, 2013 at 11:04
  • Since unicode handling is different in 3, I'll add Python 2.7 as a tag. Commented Aug 4, 2013 at 11:08
  • Which encoding is used by your terminal? Commented Aug 4, 2013 at 11:12
  • 1
    The lack of a \uXXX is because it isn't an unicode string, but a byte sequence. Prepend a u to "测试" for an unicode string. Commented Aug 4, 2013 at 11:13

2 Answers 2

4

# -*- coding: utf-8 -*- is needed to specify the encoding used in file.

You're getting ['\xe6\xb5\x8b\xe8\xaf\x95'] as output because your string is a byte string not a unicode string, add a u prefix to the string to convert it to a unicode string.

>>> strs = u"测试"
>>> lis = [strs]
>>> print lis
[u'\u6d4b\u8bd5']
>>> print lis[0]
测试
answered Aug 4, 2013 at 11:02
Sign up to request clarification or add additional context in comments.

Comments

3

You're seeing the UTF-8-encoded version of your string (which you shouldn't name str, by the way). By adding the # -*- coding: utf-8 -*- line at the start of your script, you're telling Python that that's the encoding your script is using. Are you sure that it is in fact using that encoding?

If that's not the case (check your editor!) or if your terminal window (where you're printing the string) happens to be using a different encoding, you'll get gibberish (or errors if the encoded string can't be interpreted in that encoding).

Only if you decode your (byte)string, you'll get a Unicode object.

So first you need to know your terminal's character encoding. Then you should be converting all strings to Unicode as soon as possible and manipulate only Unicode objects in your program until it's time to output them - at which point you need to encode them to the correct encoding.

For example

# -*- coding: utf-8 -*- 
s = u"测试"
s = s + u"娴嬭瘯"
print s.encode("somecodepage")
answered Aug 4, 2013 at 11:02

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.