Encoding Cyrillic in python

Asked 9 years, 9 months ago

Viewed 2k times

Part of Google Cloud Collective

After fetching an xml (with Cyillic symbols) from the site I parse it to dictionary using this code

from google.appengine.api import urlfetch
import xml.etree.ElementTree as ET
def get_class_list():
 url = 'http://samplehtml.com'
 result = urlfetch.fetch(url=url, method = urlfetch.POST)
 root = ET.fromstring(result.content)
 class_list = []
 for child in root[0]:
 class_voc = {}
 class_voc['classid'] = child.find('classid').text
 class_voc['classname'] = child.find('classname').text.encode("utf-8")
 class_list.append(class_voc)
 return class_list

I print it to web page (I use google app engine) and get this

 [{'classid': '75242', 'classname': '1\xc3\xa0'}, 
{'classid': '75244', 'classname': '1\xc3\xa1'}, 
 {'classid': '75246', 'classname': '1\xc3\xa2'}, 
{'classid': '75243', 'classname': '2\xc3\xa0'}]

I tried to encode it to utf-8 but there was no result. How can I decode it to get regular letters?

Okay, the problem is app engine doesn't support this function unicode('8в','latin1') which i can use in purpose to compare some values I need to compare, because I either can access unicode representation of chars 'classname': u'1\xe0' or some arabic like chars 1á for example. So I need both to compare values and to write them to web page.

Improve this question

edited Mar 16, 2016 at 22:17

asked Mar 15, 2016 at 20:07

Kirill's user avatar

Kirill

1732 silver badges11 bronze badges

What happens if you remove the .encode("utf-8") from your code sample? I suspect that urlfetch returns encoded text so that doing another encoding just messes things up.

minou
– minou

2016年03月16日 12:07:24 +00:00
Commented Mar 16, 2016 at 12:07
@Kekito Well, yes you're right. After removing .encode("utf-8") I got this {'classid': '75242', 'classname': u'1\xe0'}, {'classid': '75244', 'classname': u'1\xe1'} so it's I suppose a unicode representation of string. All decoding I've tried end with an error.

Kirill
– Kirill

2016年03月16日 20:28:44 +00:00
Commented Mar 16, 2016 at 20:28
It looks like you want a decode("cp1251") somewhere in your code (perhaps on result.content before it is passed to ET.fromstring?). The escaped codepoints you're seeing in your comment correspond to the letters at the start of the alphabet (а and б) in that encoding, and the letter you mention in the text в is the next one, \xe2.

Blckknght
– Blckknght

2016年03月16日 22:42:24 +00:00
Commented Mar 16, 2016 at 22:42
@Blckknght Well, I did this way result = urlfetch.fetch(url=url, method = urlfetch.POST) content = result.content.decode("cp1251") root = ET.fromstring(content) and got this error 'ascii' codec can't decode byte 0xe0 in position 97: ordinal not in range(128) And yes I know those symbols are different from the one I wanted to check. I posted them as an example because they were the first

Kirill
– Kirill

2016年03月18日 19:50:52 +00:00
Commented Mar 18, 2016 at 19:50
It sounds to me like the content value is already Unicode then, but incorrectly decoded. I'm not sure how you can tell urlfetch to decode it differently though.

Blckknght
– Blckknght

2016年03月18日 21:32:04 +00:00
Commented Mar 18, 2016 at 21:32

| Show 3 more comments

0

Sorted by: Reset to default

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Encoding Cyrillic in python

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions