Using Python 3.2, I attempted the example straight from the html.parser documentation:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Encountered a start tag:", tag)
def handle_endtag(self, tag):
print("Encountered an end tag :", tag)
def handle_data(self, data):
print("Encountered some data :", data)
parser = MyHTMLParser(strict=False)
parser.feed('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1></body></html>')
Instead of getting the result shown on the documentation i get:
Encountered some data : <html>
Encountered some data : <head>
Encountered some data : <title>
Encountered some data : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered some data : <body>
Encountered some data : <h1>
Encountered some data : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html
For some reason, it treats some tags as data BUT only if strict=False. If strict=True i get the correct result:
Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html
-
This seems like a bug: at least the documentation should be changed. Consider filing it: bugs.python.orgThomas K– Thomas K2012年02月18日 16:27:10 +00:00Commented Feb 18, 2012 at 16:27
1 Answer 1
This was a bug that has been fixed (http://bugs.python.org/issue13273). actually when you look at http://hg.python.org/cpython/log/9ce5d456138b/Lib/html/parser.py, there is a whole lot of log messages about problems with Strict=False; it almost feels like this should still be considered beta.
If you take the most recent version of the file (http://hg.python.org/cpython/raw-file/9ce5d456138b/Lib/html/parser.py) and use that, at least the example from the documentation works again. Still, personally I would be a bit weary for trusting Strict=False to work in "critical applications" at the moment.