Decode HTML entities in Python string?

Question 1

I'm parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn't automatically decode for me:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<p>&pound;682m</p>")
>>> text = soup.find("p").string
>>> print text
&pound;682m

How can I decode the HTML entities in text to get "682ドルm" instead of "£682m".

Question 2

related: Convert XML/HTML Entities into Unicode String in Python

Question 3

Python 3.4+

Use html.unescape():

import html
print(html.unescape('&pound;682m'))

FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.

Python 2.6-3.3

You can use HTMLParser.unescape() from the standard library:

For Python 2.6-2.7 it's in HTMLParser
For Python 3 it's in html.parser

>>> try:
... # Python 2.6-2.7 
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
... 
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
682ドルm

You can also use the six compatibility library to simplify the import:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
682ドルm

Question 4

this method doesn't seem to escape characters like "’" on google app engine, though it works locally on python2.6. It does still decode entities (like ") at least

Question 5

How can an undocumented API be deprecated? Edited the answer.

Question 6

@MarkusUnterwaditzer there's no reason that an undocumented method can't be deprecated. This one throws deprecation warnings - see my edit to the answer.

Question 7

Worth noting for Python 2: Special characters are replaced with their Latin-1 (ISO-8859-1) encoding counterparts. E.g., it may be necessary to h.unescape(s).encode("utf-8"). The docs: """The definition provided here contains all the entities defined by XHTML 1.0 that can be handled using simple textual substitution in the Latin-1 character set (ISO-8859-1)"""

Question 8

It does not work for 'Don&‌#039;t forget that &‌pi; = 3.14 &‌amp; doesn&‌#039;t equal 3.' WHY is that?

Question 9

Beautiful Soup handles entity conversion. In Beautiful Soup 3, you'll need to specify the convertEntities argument to the BeautifulSoup constructor (see the 'Entity Conversion' section of the archived docs). In Beautiful Soup 4, entities get decoded automatically.

Beautiful Soup 3

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>", 
... convertEntities=BeautifulSoup.HTML_ENTITIES)
<p>682ドルm</p>

Beautiful Soup 4

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>")
<html><body><p>682ドルm</p></body></html>

Question 10

+1. No idea how I missed this in the docs: thanks for the info. I'm going to accept luc's answer tho because his uses the standard lib which I specified in the question (not important to me) and its probably of more general use to other people.

Question 11

BeautifulSoup4 uses HTMLParser, mostly. See the source

Question 12

How do we get the conversion in Beautiful Soup 4 without all the extraneous HTML that wasn't part of the original string? (i.e. <html> and <body>)

Question 13

@Praxiteles : BeautifulSoup('£682m', "html.parser") stackoverflow.com/a/14822344/4376342

Question 14

You can use replace_entities from w3lib.html library

In [202]: from w3lib.html import replace_entities
In [203]: replace_entities("&pound;682m")
Out[203]: u'\xa3682m'
In [204]: print replace_entities("&pound;682m")
682ドルm

Question 15

Beautiful Soup 4 allows you to set a formatter to your output

If you pass in formatter=None, Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML, as in these examples:

print(soup.prettify(formatter=None))
# <html>
# <body>
# <p>
# Il a dit <<Sacré bleu!>>
# </p>
# </body>
# </html>
link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
print(link_soup.a.encode(formatter=None))
# <a href="http://example.com/?foo=val1&bar=val2">A link</a>

Question 16

This doesn't answer the question. (Also, I have no idea what the docs are saying is invalid about the final bit of HTML here.)

Question 17

<<Sacré bleu!>> is the invalid part, as it has unescaped < and > and will break the html around it. I know this is a late post from me, but in case anyone happens to be looking and wondered...

Question 18

I had a similar encoding issue. I used the normalize() method. I was getting a Unicode error using the pandas .to_html() method when exporting my data frame to an .html file in another directory. I ended up doing this and it worked...

 import unicodedata

The dataframe object can be whatever you like, let's call it table...

 table = pd.DataFrame(data,columns=['Name','Team','OVR / POT'])
 table.index+= 1

encode table data so that we can export it to out .html file in templates folder(this can be whatever location you wish :))

 #this is where the magic happens
 html_data=unicodedata.normalize('NFKD',table.to_html()).encode('ascii','ignore')

export normalized string to html file

 file = open("templates/home.html","w") 
 file.write(html_data) 
 file.close()

Reference: unicodedata documentation

Question 19

This does not answer the question. A Unicode normalization form is not the same as HTML entities.

Question 20

import html
 
myHtml = "<body><h1> How to use html.unescape() in Python </h1></body>"
encodedHtml = html.escape(myHtml)
print("Encoded HTML: ", encodedHtml)
decodedHtml = html.unescape(encodedHtml)
 
print("Decoded HTML: ", decodedHtml)

Output:

Encoded HTML: &lt;body&gt;&lt;h1&gt; How to use html.unescape() in Python &lt;/h1&gt;&lt;/body&gt;
Decoded HTML: <body><h1> How to use html.unescape() in Python </h1></body>

Demo

Question 21

This probably isnt relevant here. But to eliminate these html entites from an entire document, you can do something like this: (Assume document = page and please forgive the sloppy code, but if you have ideas as to how to make it better, Im all ears - Im new to this).

import re
import HTMLParser
regexp = "&.+?;" 
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
 h = HTMLParser.HTMLParser()
 unescaped = h.unescape(e) #finds the unescaped value of the html entity
 page = page.replace(e, unescaped) #replaces html entity with unescaped value

Question 22

No! You don't need to match HTML entities yourself and loop over them; .unescape() does that for you. I don't understand why you and Rob have posted these overcomplicated solutions that roll their own entity matching when the accepted answer already clearly shows that .unescape() can find entities in the string.

luc 43.4k25 gold badges132 silver badges173 bronze badges · Accepted Answer · 2010-01-18 16:17:50Z

743

Python 3.4+

Use html.unescape():

import html
print(html.unescape('&pound;682m'))

FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.

Python 2.6-3.3

You can use HTMLParser.unescape() from the standard library:

For Python 2.6-2.7 it's in HTMLParser
For Python 3 it's in html.parser

>>> try:
... # Python 2.6-2.7 
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
... 
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
682ドルm

You can also use the six compatibility library to simplify the import:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
682ドルm

Share

Improve this answer

edited Jul 2, 2019 at 17:13

wjandrea's user avatar

wjandrea

34k10 gold badges69 silver badges107 bronze badges

answered Jan 18, 2010 at 16:17

luc's user avatar

luc

43.4k25 gold badges132 silver badges173 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

gfxmonk

gfxmonk Over a year ago

this method doesn't seem to escape characters like "’" on google app engine, though it works locally on python2.6. It does still decode entities (like ") at least

2010年07月10日T14:40:38.61Z+00:00

Markus Unterwaditzer

Markus Unterwaditzer Over a year ago

How can an undocumented API be deprecated? Edited the answer.

2015年06月05日T18:15:19.943Z+00:00

Mark Amery

Mark Amery Over a year ago

@MarkusUnterwaditzer there's no reason that an undocumented method can't be deprecated. This one throws deprecation warnings - see my edit to the answer.

2015年11月25日T15:06:41.92Z+00:00

anonymous coward

anonymous coward Over a year ago

Worth noting for Python 2: Special characters are replaced with their Latin-1 (ISO-8859-1) encoding counterparts. E.g., it may be necessary to h.unescape(s).encode("utf-8"). The docs: """The definition provided here contains all the entities defined by XHTML 1.0 that can be handled using simple textual substitution in the Latin-1 character set (ISO-8859-1)"""

2018年09月05日T15:03:24.12Z+00:00

canbax

canbax Over a year ago

It does not work for 'Don&‌#039;t forget that &‌pi; = 3.14 &‌amp; doesn&‌#039;t equal 3.' WHY is that?

2020年05月01日T09:15:53.017Z+00:00

|

CollectivesTM on Stack Overflow

Decode HTML entities in Python string?

7 Answers 7

Python 3.4+

Python 2.6-3.3

8 Comments

Beautiful Soup 3

Beautiful Soup 4

4 Comments

Comments

2 Comments

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

7 Answers 7

Python 3.4+

Python 2.6-3.3

8 Comments

Beautiful Soup 3

Beautiful Soup 4

4 Comments

Comments

2 Comments

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related