I'm parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn't automatically decode for me:
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<p>£682m</p>")
>>> text = soup.find("p").string
>>> print text
£682m
How can I decode the HTML entities in text to get "682ドルm" instead of "£682m".
-
3related: Convert XML/HTML Entities into Unicode String in Pythonjfs– jfs2012年12月18日 19:01:10 +00:00Commented Dec 18, 2012 at 19:01
7 Answers 7
Python 3.4+
Use html.unescape():
import html
print(html.unescape('£682m'))
FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.
Python 2.6-3.3
You can use HTMLParser.unescape() from the standard library:
- For Python 2.6-2.7 it's in
HTMLParser - For Python 3 it's in
html.parser
>>> try:
... # Python 2.6-2.7
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
682ドルm
You can also use the six compatibility library to simplify the import:
>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
682ドルm
8 Comments
h.unescape(s).encode("utf-8"). The docs: """The definition provided here contains all the entities defined by XHTML 1.0 that can be handled using simple textual substitution in the Latin-1 character set (ISO-8859-1)"""Beautiful Soup handles entity conversion. In Beautiful Soup 3, you'll need to specify the convertEntities argument to the BeautifulSoup constructor (see the 'Entity Conversion' section of the archived docs). In Beautiful Soup 4, entities get decoded automatically.
Beautiful Soup 3
>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>£682m</p>",
... convertEntities=BeautifulSoup.HTML_ENTITIES)
<p>682ドルm</p>
Beautiful Soup 4
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>£682m</p>")
<html><body><p>682ドルm</p></body></html>
4 Comments
You can use replace_entities from w3lib.html library
In [202]: from w3lib.html import replace_entities
In [203]: replace_entities("£682m")
Out[203]: u'\xa3682m'
In [204]: print replace_entities("£682m")
682ドルm
Comments
Beautiful Soup 4 allows you to set a formatter to your output
If you pass in
formatter=None, Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML, as in these examples:
print(soup.prettify(formatter=None))
# <html>
# <body>
# <p>
# Il a dit <<Sacré bleu!>>
# </p>
# </body>
# </html>
link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
print(link_soup.a.encode(formatter=None))
# <a href="http://example.com/?foo=val1&bar=val2">A link</a>
2 Comments
I had a similar encoding issue. I used the normalize() method. I was getting a Unicode error using the pandas .to_html() method when exporting my data frame to an .html file in another directory. I ended up doing this and it worked...
import unicodedata
The dataframe object can be whatever you like, let's call it table...
table = pd.DataFrame(data,columns=['Name','Team','OVR / POT'])
table.index+= 1
encode table data so that we can export it to out .html file in templates folder(this can be whatever location you wish :))
#this is where the magic happens
html_data=unicodedata.normalize('NFKD',table.to_html()).encode('ascii','ignore')
export normalized string to html file
file = open("templates/home.html","w")
file.write(html_data)
file.close()
Reference: unicodedata documentation
1 Comment
import html
myHtml = "<body><h1> How to use html.unescape() in Python </h1></body>"
encodedHtml = html.escape(myHtml)
print("Encoded HTML: ", encodedHtml)
decodedHtml = html.unescape(encodedHtml)
print("Decoded HTML: ", decodedHtml)
Output:
Encoded HTML: <body><h1> How to use html.unescape() in Python </h1></body>
Decoded HTML: <body><h1> How to use html.unescape() in Python </h1></body>
Comments
This probably isnt relevant here. But to eliminate these html entites from an entire document, you can do something like this: (Assume document = page and please forgive the sloppy code, but if you have ideas as to how to make it better, Im all ears - Im new to this).
import re
import HTMLParser
regexp = "&.+?;"
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
h = HTMLParser.HTMLParser()
unescaped = h.unescape(e) #finds the unescaped value of the html entity
page = page.replace(e, unescaped) #replaces html entity with unescaped value
1 Comment
.unescape() does that for you. I don't understand why you and Rob have posted these overcomplicated solutions that roll their own entity matching when the accepted answer already clearly shows that .unescape() can find entities in the string.