Python how to handle unicode text

Asked 11 years, 9 months ago

Viewed 530 times

I am using Python 2.6.6

item = {u'snippet': {u'title': u'How to Pronounce Canap\xe9'}}
title = item['snippet']['title']
print title

Result:

How to Pronounce CanapÃ©

Desired result:

How to Pronounce Canapé

This looks like a Unicode issue, I tried encode and decode to utf8, but result still the same, any ideas?

Improve this question

asked Mar 19, 2014 at 4:18

davidjhp's user avatar

davidjhp

8,11411 gold badges41 silver badges56 bronze badges

That code sample works fine in my terminal. I have to assume this is an issue with your OS or terminal. What OS/Terminal software are you using?

Ben Echols
– Ben Echols

2014年03月19日 04:21:19 +00:00
Commented Mar 19, 2014 at 4:21
How are you running this code?

Burhan Khalid
– Burhan Khalid

2014年03月19日 04:29:26 +00:00
Commented Mar 19, 2014 at 4:29
@BenEchols, OS is CentOS 6.4, Terminal is SecureCRT 4.0

davidjhp
– davidjhp

2014年03月19日 04:31:27 +00:00
Commented Mar 19, 2014 at 4:31
@BurhanKhalid, on command line I type python, that puts me into the Python shell

davidjhp
– davidjhp

2014年03月19日 04:32:31 +00:00
Commented Mar 19, 2014 at 4:32
2

Check the encoding of your SecureCRT session and make sure its UTF8 and not latin-1 or similar.

Burhan Khalid
– Burhan Khalid

2014年03月19日 04:35:21 +00:00
Commented Mar 19, 2014 at 4:35

| Show 4 more comments

5 Answers 5

Sorted by: Reset to default

Your terminal expects UTF-8:

$ locale charmap
UTF-8

Python prints using UTF-8:

>>> sys.stdout.encoding
UTF-8

Change SecureCRT setting to accept UTF-8.

Improve this answer

answered Mar 19, 2014 at 5:29

jfs's user avatar

jfs

417k211 gold badges1k silver badges1.7k bronze badges

Comments

This is quite possibly due to mismatch of the default encoding that Python is using versus the console's encoding. It looks like Python is assuming that the encoding is UTF-8 but then the console is interpreting that as latin-1.

Improve this answer

answered Mar 19, 2014 at 4:21

metatoaster's user avatar

metatoaster

19.2k5 gold badges65 silver badges74 bronze badges

Comments

Instead of \xe9, use \u00e9 if possible. Then pick an appropriate encoding when outputting the unicode string:

print title.encode('latin1')

What encoding is sensible depends on where you are outputting to. Generally, you have to infer it from the environment variables, or maybe let your users make a choice in a configuration file.

PS: If you deal with Unicode strings a lot, I'd recommend switching to Python 3 (e.g. 3.3), if at all possible. Unicode handling is a lot more clear/explicit/sane, there.

Improve this answer

edited Mar 19, 2014 at 4:45

answered Mar 19, 2014 at 4:23

Christian Aichinger's user avatar

Christian Aichinger

7,2674 gold badges46 silver badges63 bronze badges

5 Comments

davidjhp

davidjhp Over a year ago

I am not able to change \xe9 to \u00e9, the \xe9 is raw data from YouTube API

2014年03月19日T04:37:48.22Z+00:00

Christian Aichinger

Christian Aichinger Over a year ago

Ok, that shouldn't matter for Python2.7. From the output you've show, I think 'latin1' might be the correct encoding in your case.

2014年03月19日T04:45:30.357Z+00:00

jfs

jfs Over a year ago

@ChristianAichinger: u'\xe9' == u'\u00e9' therefore changing it won't help. Instead of .encode('latin1'), change SecureCRT to match the terminal settings on CentOS. If sys.stdout.encoding is correct (it matches $LC_TYPE, $LANG) then using Python 3 won't help

2014年03月19日T05:19:49.03Z+00:00

davidjhp

davidjhp Over a year ago

@J.F.Sebastian, I am getting the same error when I write the values to a file on the file system, would that indicate the problem is not SecureCRT?

2014年03月19日T05:27:43.693Z+00:00

jfs

jfs Over a year ago

@davidjhp: Writing to a file is different from writing to a terminal. If the output is redirected to a file, you could control the stdout encoding using PYTHONIOENCODING. Update your question with the output of print(repr(open("your_output_file", "rb").read()))?

2014年03月19日T05:41:50.9Z+00:00

I am getting your expected output in my terminal (using python 2.7.7) The format you are expecting depends on encoding set in the terminal. For me, it is set to 'cp437'

>>> import sys
>>> sys.stdin.encoding
'cp437'
>>> sys.stdout.encoding
'cp437'

You can verify that, you are getting correct output by giving:

print title.encode('cp437')

Improve this answer

answered Mar 19, 2014 at 4:35

venpa's user avatar

venpa

4,32823 silver badges24 bronze badges

Comments

set your default encoding to iso-8859-1 in your sitecustomize.py file in ${pythondir}/lib/site-packages/ as

import sys
sys.setdefaultencoding('iso-8859-1')

for me it worked with \xe9.

Improve this answer

answered Mar 19, 2014 at 4:59

c0d3's user avatar

c0d3

10711 bronze badges

2 Comments

davidjhp

davidjhp Over a year ago

AttributeError: 'module' object has no attribute 'setdefaultencodi

2014年03月19日T05:19:30.11Z+00:00

jfs

jfs Over a year ago

@davidjhp: don't do it. Changing sys.getdefaultencoding() from 'ascii' might break other Python scripts on your system in a subtle way.

2014年03月19日T05:22:02.067Z+00:00

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Python how to handle unicode text

5 Answers 5

Comments

Comments

5 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

5 Answers 5

Comments

Comments

5 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related