After fetching an xml (with Cyillic symbols) from the site I parse it to dictionary using this code
from google.appengine.api import urlfetch
import xml.etree.ElementTree as ET
def get_class_list():
url = 'http://samplehtml.com'
result = urlfetch.fetch(url=url, method = urlfetch.POST)
root = ET.fromstring(result.content)
class_list = []
for child in root[0]:
class_voc = {}
class_voc['classid'] = child.find('classid').text
class_voc['classname'] = child.find('classname').text.encode("utf-8")
class_list.append(class_voc)
return class_list
I print it to web page (I use google app engine) and get this
[{'classid': '75242', 'classname': '1\xc3\xa0'},
{'classid': '75244', 'classname': '1\xc3\xa1'},
{'classid': '75246', 'classname': '1\xc3\xa2'},
{'classid': '75243', 'classname': '2\xc3\xa0'}]
I tried to encode it to utf-8 but there was no result. How can I decode it to get regular letters?
Okay, the problem is app engine doesn't support this function unicode('8в','latin1') which i can use in purpose to compare some values I need to compare, because I either can access unicode representation of chars 'classname': u'1\xe0' or some arabic like chars 1á for example. So I need both to compare values and to write them to web page.
.encode("utf-8")from your code sample? I suspect that urlfetch returns encoded text so that doing another encoding just messes things up..encode("utf-8")I got this{'classid': '75242', 'classname': u'1\xe0'}, {'classid': '75244', 'classname': u'1\xe1'}so it's I suppose a unicode representation of string. All decoding I've tried end with an error.decode("cp1251")somewhere in your code (perhaps onresult.contentbefore it is passed toET.fromstring?). The escaped codepoints you're seeing in your comment correspond to the letters at the start of the alphabet (аandб) in that encoding, and the letter you mention in the textвis the next one,\xe2.result = urlfetch.fetch(url=url, method = urlfetch.POST) content = result.content.decode("cp1251") root = ET.fromstring(content)and got this error'ascii' codec can't decode byte 0xe0 in position 97: ordinal not in range(128)And yes I know those symbols are different from the one I wanted to check. I posted them as an example because they were the firstcontentvalue is already Unicode then, but incorrectly decoded. I'm not sure how you can tellurlfetchto decode it differently though.