0

Possible Duplicate:
Convert XML/HTML Entities into Unicode String in Python

I am reading an excel XML document using Python. I end up with a lot of characters such as é

That represent various accented letters (and the like). Is there an easy way to convert these characters to utf-8?

asked Dec 18, 2012 at 8:09
6
  • 1
    You'll need to give more details. Usually it is relatively easy to encode and decode in python, provided you understand what is going on. Commented Dec 18, 2012 at 8:11
  • 1
    In particular, are you using Python 2 or 3, do you have byte strings or Unicode strings, and if byte strings what character set are they in? (It also may help to know which module you're using to read/parse the document.) Commented Dec 18, 2012 at 8:13
  • Thanks Marijn for the quick response. I think the main problem I am facing is that I dont know what encoding this is. I get the sense that its not an "encoding" really, rather something specific to xml. In terms of more info, I dont really have any. I have a list of names with "encodings" such as the one above all over the place. The names are from various countries, thus, the various accented characters. Commented Dec 18, 2012 at 8:13
  • Using Python2, string comes in as bytes (string is from an excel xml file), but I convert it to unicode using .decode("utf-8"), and the set is utf-8. Commented Dec 18, 2012 at 8:16
  • 1
    OK, so you have properly-decoded Unicode strings, except that some of the characters are escaped as XML entity references rather than directly available as characters. Depending on how you're doing the XML parsing, you may be able to do it while parsing; otherwise, this definitely looks like a dup of the other question. Commented Dec 18, 2012 at 8:17

2 Answers 2

1

If you just want to parse the HTML entity to its unicode equivalent:

>>> import HTMLParser
>>> parser = HTMLParser.HTMLParser()
>>> parser.unescape('é')
u'\xe9'
>>> print parser.unescape('é')
é

This is for Python 2.x, for 3.x the import is import html.parser

answered Dec 18, 2012 at 8:21
Sign up to request clarification or add additional context in comments.

2 Comments

This is an undocumented function that just happens to be in the CPython implementation of HTMLParser—and it doesn't actually work properly until either 2.6/3.0 or 2.7/3.1 (I forget which). So I don't think it's the ideal solution, except for a quick&dirty hack. There are better solutions (along with this one) on the question this is a dup of.
Using the tips from this QandA and the other, I have the following solution which seems to work:
0

Using tips from this QandA and the other one, I have a solution that seems to work. It takes an entire document and removes all html entities from the document.

import re
import HTMLParser
regexp = "&.+?;" 
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
 h = HTMLParser.HTMLParser()
 unescaped = h.unescape(e) #finds the unescaped value of the html entity
 page = page.replace(e, unescaped) #replaces html entity with unescaped value
answered Dec 18, 2012 at 18:34

1 Comment

Obviously, one downside of the above code is that if the same html entity appears more than once in the page (as it almost always does), the above code will run the same replace call multiple times. Its an easy fix, just have to remove all repeats from list_of_html set before running the replace loop.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.