Possible Duplicate:
Convert XML/HTML Entities into Unicode String in Python
I am reading an excel XML document using Python. I end up with a lot of characters such as é
That represent various accented letters (and the like). Is there an easy way to convert these characters to utf-8?
-
1You'll need to give more details. Usually it is relatively easy to encode and decode in python, provided you understand what is going on.Martijn Pieters– Martijn Pieters2012年12月18日 08:11:39 +00:00Commented Dec 18, 2012 at 8:11
-
1In particular, are you using Python 2 or 3, do you have byte strings or Unicode strings, and if byte strings what character set are they in? (It also may help to know which module you're using to read/parse the document.)abarnert– abarnert2012年12月18日 08:13:04 +00:00Commented Dec 18, 2012 at 8:13
-
Thanks Marijn for the quick response. I think the main problem I am facing is that I dont know what encoding this is. I get the sense that its not an "encoding" really, rather something specific to xml. In terms of more info, I dont really have any. I have a list of names with "encodings" such as the one above all over the place. The names are from various countries, thus, the various accented characters.Neil Aggarwal– Neil Aggarwal2012年12月18日 08:13:57 +00:00Commented Dec 18, 2012 at 8:13
-
Using Python2, string comes in as bytes (string is from an excel xml file), but I convert it to unicode using .decode("utf-8"), and the set is utf-8.Neil Aggarwal– Neil Aggarwal2012年12月18日 08:16:03 +00:00Commented Dec 18, 2012 at 8:16
-
1OK, so you have properly-decoded Unicode strings, except that some of the characters are escaped as XML entity references rather than directly available as characters. Depending on how you're doing the XML parsing, you may be able to do it while parsing; otherwise, this definitely looks like a dup of the other question.abarnert– abarnert2012年12月18日 08:17:09 +00:00Commented Dec 18, 2012 at 8:17
2 Answers 2
If you just want to parse the HTML entity to its unicode equivalent:
>>> import HTMLParser
>>> parser = HTMLParser.HTMLParser()
>>> parser.unescape('é')
u'\xe9'
>>> print parser.unescape('é')
é
This is for Python 2.x, for 3.x the import is import html.parser
2 Comments
HTMLParser—and it doesn't actually work properly until either 2.6/3.0 or 2.7/3.1 (I forget which). So I don't think it's the ideal solution, except for a quick&dirty hack. There are better solutions (along with this one) on the question this is a dup of.Using tips from this QandA and the other one, I have a solution that seems to work. It takes an entire document and removes all html entities from the document.
import re
import HTMLParser
regexp = "&.+?;"
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
h = HTMLParser.HTMLParser()
unescaped = h.unescape(e) #finds the unescaped value of the html entity
page = page.replace(e, unescaped) #replaces html entity with unescaped value