I am using Python 2.6.6
item = {u'snippet': {u'title': u'How to Pronounce Canap\xe9'}}
title = item['snippet']['title']
print title
Result:
How to Pronounce Canapé
Desired result:
How to Pronounce Canapé
This looks like a Unicode issue, I tried encode and decode to utf8, but result still the same, any ideas?
-
That code sample works fine in my terminal. I have to assume this is an issue with your OS or terminal. What OS/Terminal software are you using?Ben Echols– Ben Echols2014年03月19日 04:21:19 +00:00Commented Mar 19, 2014 at 4:21
-
How are you running this code?Burhan Khalid– Burhan Khalid2014年03月19日 04:29:26 +00:00Commented Mar 19, 2014 at 4:29
-
@BenEchols, OS is CentOS 6.4, Terminal is SecureCRT 4.0davidjhp– davidjhp2014年03月19日 04:31:27 +00:00Commented Mar 19, 2014 at 4:31
-
@BurhanKhalid, on command line I type python, that puts me into the Python shelldavidjhp– davidjhp2014年03月19日 04:32:31 +00:00Commented Mar 19, 2014 at 4:32
-
2Check the encoding of your SecureCRT session and make sure its UTF8 and not latin-1 or similar.Burhan Khalid– Burhan Khalid2014年03月19日 04:35:21 +00:00Commented Mar 19, 2014 at 4:35
5 Answers 5
Your terminal expects UTF-8:
$ locale charmap
UTF-8
Python prints using UTF-8:
>>> sys.stdout.encoding
UTF-8
Change SecureCRT setting to accept UTF-8.
Comments
This is quite possibly due to mismatch of the default encoding that Python is using versus the console's encoding. It looks like Python is assuming that the encoding is UTF-8 but then the console is interpreting that as latin-1.
Comments
Instead of \xe9, use \u00e9 if possible. Then pick an appropriate encoding when outputting the unicode string:
print title.encode('latin1')
What encoding is sensible depends on where you are outputting to. Generally, you have to infer it from the environment variables, or maybe let your users make a choice in a configuration file.
PS: If you deal with Unicode strings a lot, I'd recommend switching to Python 3 (e.g. 3.3), if at all possible. Unicode handling is a lot more clear/explicit/sane, there.
5 Comments
'latin1' might be the correct encoding in your case.u'\xe9' == u'\u00e9' therefore changing it won't help. Instead of .encode('latin1'), change SecureCRT to match the terminal settings on CentOS. If sys.stdout.encoding is correct (it matches $LC_TYPE, $LANG) then using Python 3 won't helpPYTHONIOENCODING. Update your question with the output of print(repr(open("your_output_file", "rb").read()))?I am getting your expected output in my terminal (using python 2.7.7) The format you are expecting depends on encoding set in the terminal. For me, it is set to 'cp437'
>>> import sys
>>> sys.stdin.encoding
'cp437'
>>> sys.stdout.encoding
'cp437'
You can verify that, you are getting correct output by giving:
print title.encode('cp437')
Comments
set your default encoding to iso-8859-1 in your sitecustomize.py file in ${pythondir}/lib/site-packages/ as
import sys
sys.setdefaultencoding('iso-8859-1')
for me it worked with \xe9.