Python, XML, &#233 type encodings [duplicate]

Question 1

Possible Duplicate:
Convert XML/HTML Entities into Unicode String in Python

I am reading an excel XML document using Python. I end up with a lot of characters such as é

That represent various accented letters (and the like). Is there an easy way to convert these characters to utf-8?

Question 2

You'll need to give more details. Usually it is relatively easy to encode and decode in python, provided you understand what is going on.

Question 3

In particular, are you using Python 2 or 3, do you have byte strings or Unicode strings, and if byte strings what character set are they in? (It also may help to know which module you're using to read/parse the document.)

Question 4

Thanks Marijn for the quick response. I think the main problem I am facing is that I dont know what encoding this is. I get the sense that its not an "encoding" really, rather something specific to xml. In terms of more info, I dont really have any. I have a list of names with "encodings" such as the one above all over the place. The names are from various countries, thus, the various accented characters.

Question 5

Using Python2, string comes in as bytes (string is from an excel xml file), but I convert it to unicode using .decode("utf-8"), and the set is utf-8.

Question 6

OK, so you have properly-decoded Unicode strings, except that some of the characters are escaped as XML entity references rather than directly available as characters. Depending on how you're doing the XML parsing, you may be able to do it while parsing; otherwise, this definitely looks like a dup of the other question.

Question 7

If you just want to parse the HTML entity to its unicode equivalent:

>>> import HTMLParser
>>> parser = HTMLParser.HTMLParser()
>>> parser.unescape('&#233;')
u'\xe9'
>>> print parser.unescape('&#233;')
é

This is for Python 2.x, for 3.x the import is import html.parser

Question 8

This is an undocumented function that just happens to be in the CPython implementation of HTMLParser—and it doesn't actually work properly until either 2.6/3.0 or 2.7/3.1 (I forget which). So I don't think it's the ideal solution, except for a quick&dirty hack. There are better solutions (along with this one) on the question this is a dup of.

Question 9

Using the tips from this QandA and the other, I have the following solution which seems to work:

Question 10

Using tips from this QandA and the other one, I have a solution that seems to work. It takes an entire document and removes all html entities from the document.

import re
import HTMLParser
regexp = "&.+?;" 
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
 h = HTMLParser.HTMLParser()
 unescaped = h.unescape(e) #finds the unescaped value of the html entity
 page = page.replace(e, unescaped) #replaces html entity with unescaped value

Question 11

Obviously, one downside of the above code is that if the same html entity appears more than once in the page (as it almost always does), the above code will run the same replace call multiple times. Its an easy fix, just have to remove all repeats from list_of_html set before running the replace loop.

Burhan Khalid 175k20 gold badges255 silver badges292 bronze badges · Accepted Answer · 2012-12-18 08:21:52Z

1

If you just want to parse the HTML entity to its unicode equivalent:

>>> import HTMLParser
>>> parser = HTMLParser.HTMLParser()
>>> parser.unescape('&#233;')
u'\xe9'
>>> print parser.unescape('&#233;')
é

This is for Python 2.x, for 3.x the import is import html.parser

Share

Improve this answer

answered Dec 18, 2012 at 8:21

Burhan Khalid's user avatar

Burhan Khalid

175k20 gold badges255 silver badges292 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

abarnert

abarnert Over a year ago

This is an undocumented function that just happens to be in the CPython implementation of HTMLParser—and it doesn't actually work properly until either 2.6/3.0 or 2.7/3.1 (I forget which). So I don't think it's the ideal solution, except for a quick&dirty hack. There are better solutions (along with this one) on the question this is a dup of.

2012年12月18日T08:30:29.01Z+00:00

Neil Aggarwal

Neil Aggarwal Over a year ago

Using the tips from this QandA and the other, I have the following solution which seems to work:

2012年12月18日T18:29:54.233Z+00:00

CollectivesTM on Stack Overflow

Python, XML, &#233 type encodings [duplicate]

2 Answers 2

2 Comments

1 Comment

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Linked

Related