1

After fetching an xml (with Cyillic symbols) from the site I parse it to dictionary using this code

from google.appengine.api import urlfetch
import xml.etree.ElementTree as ET
def get_class_list():
 url = 'http://samplehtml.com'
 result = urlfetch.fetch(url=url, method = urlfetch.POST)
 root = ET.fromstring(result.content)
 class_list = []
 for child in root[0]:
 class_voc = {}
 class_voc['classid'] = child.find('classid').text
 class_voc['classname'] = child.find('classname').text.encode("utf-8")
 class_list.append(class_voc)
 return class_list

I print it to web page (I use google app engine) and get this

 [{'classid': '75242', 'classname': '1\xc3\xa0'}, 
{'classid': '75244', 'classname': '1\xc3\xa1'}, 
 {'classid': '75246', 'classname': '1\xc3\xa2'}, 
{'classid': '75243', 'classname': '2\xc3\xa0'}]

I tried to encode it to utf-8 but there was no result. How can I decode it to get regular letters?

Okay, the problem is app engine doesn't support this function unicode('8в','latin1') which i can use in purpose to compare some values I need to compare, because I either can access unicode representation of chars 'classname': u'1\xe0' or some arabic like chars for example. So I need both to compare values and to write them to web page.

asked Mar 15, 2016 at 20:07
8
  • What happens if you remove the .encode("utf-8") from your code sample? I suspect that urlfetch returns encoded text so that doing another encoding just messes things up. Commented Mar 16, 2016 at 12:07
  • @Kekito Well, yes you're right. After removing .encode("utf-8") I got this {'classid': '75242', 'classname': u'1\xe0'}, {'classid': '75244', 'classname': u'1\xe1'} so it's I suppose a unicode representation of string. All decoding I've tried end with an error. Commented Mar 16, 2016 at 20:28
  • It looks like you want a decode("cp1251") somewhere in your code (perhaps on result.content before it is passed to ET.fromstring?). The escaped codepoints you're seeing in your comment correspond to the letters at the start of the alphabet (а and б) in that encoding, and the letter you mention in the text в is the next one, \xe2. Commented Mar 16, 2016 at 22:42
  • @Blckknght Well, I did this way result = urlfetch.fetch(url=url, method = urlfetch.POST) content = result.content.decode("cp1251") root = ET.fromstring(content) and got this error 'ascii' codec can't decode byte 0xe0 in position 97: ordinal not in range(128) And yes I know those symbols are different from the one I wanted to check. I posted them as an example because they were the first Commented Mar 18, 2016 at 19:50
  • It sounds to me like the content value is already Unicode then, but incorrectly decoded. I'm not sure how you can tell urlfetch to decode it differently though. Commented Mar 18, 2016 at 21:32

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.