I'm doing some web scraping and sites frequently use HTML entities to represent non ascii characters. Does Python have a utility that takes a string with HTML entities and returns a unicode type?
For example:
I get back:
ǎ
which represents an "ǎ" with a tone mark. In binary, this is represented as the 16 bit 01ce. I want to convert the html entity into the value u'\u01ce'
-
related: Decode HTML entities in Python string?jfs– jfs2016年02月02日 08:36:49 +00:00Commented Feb 2, 2016 at 8:36
10 Answers 10
The standard lib’s very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:
up to Python 3.4:
import HTMLParser
h = HTMLParser.HTMLParser()
h.unescape('© 2010') # u'\xa9 2010'
h.unescape('© 2010') # u'\xa9 2010'
Python 3.4+:
import html
html.unescape('© 2010') # u'\xa9 2010'
html.unescape('© 2010') # u'\xa9 2010'
4 Comments
& or >.html.unescape() function in Python 3.4+UnicodeDecodeError with utf-8 strings. You must either decode('utf-8') it first or use xml.sax.saxutils.unescape.Python has the htmlentitydefs module, but this doesn't include a function to unescape HTML entities.
Python developer Fredrik Lundh (author of elementtree, among other things) has such a function on his website, which works with decimal, hex and named entities:
import re, htmlentitydefs
##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.
def unescape(text):
def fixup(m):
text = m.group(0)
if text[:2] == "&#":
# character reference
try:
if text[:3] == "&#x":
return unichr(int(text[3:-1], 16))
else:
return unichr(int(text[2:-1]))
except ValueError:
pass
else:
# named entity
try:
text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
except KeyError:
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)
Use the builtin unichr -- BeautifulSoup isn't necessary:
>>> entity = 'ǎ'
>>> unichr(int(entity[3:],16))
u'\u01ce'
2 Comments
try...catch the resulting exception for when you get it wrong.unichar was removed in python3. Any suggestion for that version?If you are on Python 3.4 or newer, you can simply use the html.unescape:
import html
s = html.unescape(s)
Comments
An alternative, if you have lxml:
>>> import lxml.html
>>> lxml.html.fromstring('ǎ').text
u'\u01ce'
2 Comments
str if there is no special character.You could find an answer here -- Getting international characters from a web page?
EDIT: It seems like BeautifulSoup doesn't convert entities written in hexadecimal form. It can be fixed:
import copy, re
from BeautifulSoup import BeautifulSoup
hexentityMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
# replace hexadecimal character reference by decimal one
hexentityMassage += [(re.compile('&#x([^;]+);'),
lambda m: '&#%d;' % int(m.group(1), 16))]
def convert(html):
return BeautifulSoup(html,
convertEntities=BeautifulSoup.HTML_ENTITIES,
markupMassage=hexentityMassage).contents[0].string
html = '<html>ǎǎ</html>'
print repr(convert(html))
# u'\u01ce\u01ce'
EDIT:
unescape() function mentioned by @dF which uses htmlentitydefs standard module and unichr() might be more appropriate in this case.
5 Comments
html.unescape() is a better option on the modern Python.HTMLParser.HTMLParser().unescape() hack worked for you, using BeautifulSoup might be a better alternative than defining unescape() by hand (vendoring a pure Python lib vs. a copy-paste of the function).This is a function which should help you to get it right and convert entities back to utf-8 characters.
def unescape(text):
"""Removes HTML or XML character references
and entities from a text string.
@param text The HTML (or XML) source text.
@return The plain text, as a Unicode string, if necessary.
from Fredrik Lundh
2008年01月03日: input only unicode characters string.
http://effbot.org/zone/re-sub.htm#unescape-html
"""
def fixup(m):
text = m.group(0)
if text[:2] == "&#":
# character reference
try:
if text[:3] == "&#x":
return unichr(int(text[3:-1], 16))
else:
return unichr(int(text[2:-1]))
except ValueError:
print "Value Error"
pass
else:
# named entity
# reescape the reserved characters.
try:
if text[1:-1] == "amp":
text = "&amp;"
elif text[1:-1] == "gt":
text = "&gt;"
elif text[1:-1] == "lt":
text = "&lt;"
else:
print text[1:-1]
text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
except KeyError:
print "keyerror"
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)
Not sure why the Stack Overflow thread does not include the ';' in the search/replace (i.e. lambda m: '&#%d*;*') If you don't, BeautifulSoup can barf because the adjacent character can be interpreted as part of the HTML code (i.e. 'B for 'Blackout).
This worked better for me:
import re
from BeautifulSoup import BeautifulSoup
html_string='<a href="/cgi-bin/article.cgi?f=/c/a/2010/12/13/BA3V1GQ1CI.DTL"title="">'Blackout in a can; on some shelves despite ban</a>'
hexentityMassage = [(re.compile('&#x([^;]+);'),
lambda m: '&#%d;' % int(m.group(1), 16))]
soup = BeautifulSoup(html_string,
convertEntities=BeautifulSoup.HTML_ENTITIES,
markupMassage=hexentityMassage)
- The int(m.group(1), 16) converts the number (specified in base-16) format back to an integer.
- m.group(0) returns the entire match, m.group(1) returns the regexp capturing group
- Basically using markupMessage is the same as:
html_string = re.sub('&#x([^;]+);', lambda m: '&#%d;' % int(m.group(1), 16), html_string)
Another solution is the builtin library xml.sax.saxutils (both for html and xml). However, it will convert only >, & and <.
from xml.sax.saxutils import unescape
escaped_text = unescape(text_to_escape)
Comments
Here is the Python 3 version of dF's answer:
import re
import html.entities
def unescape(text):
"""
Removes HTML or XML character references and entities from a text string.
:param text: The HTML (or XML) source text.
:return: The plain text, as a Unicode string, if necessary.
"""
def fixup(m):
text = m.group(0)
if text[:2] == "&#":
# character reference
try:
if text[:3] == "&#x":
return chr(int(text[3:-1], 16))
else:
return chr(int(text[2:-1]))
except ValueError:
pass
else:
# named entity
try:
text = chr(html.entities.name2codepoint[text[1:-1]])
except KeyError:
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)
The main changes concern htmlentitydefs that is now html.entities and unichr that is now chr. See this Python 3 porting guide.
2 Comments
html.unescape(); why have a dog and bark yourself?html.entities.entitydefs["apos"] does not exist, and html.unescape('can't') produces "can't" which uses the U+0027 (') instead of the proper U+2019 (’) (or U+02BC, depending on which argument you follow.). But I guess that’s intended according to the character entity reference.