homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: HTMLParser silently stops parsing with malformed attributes
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: eric.araujo, ezio.melotti, python-dev, r.david.murray, teoryn
Priority: normal Keywords: patch

Created on 2011年07月24日 18:35 by teoryn, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
test.py teoryn, 2011年07月24日 18:35 Example of the broken behavior
issue12629.diff ezio.melotti, 2011年11月01日 13:21 Failing test review
Messages (8)
msg141051 - (view) Author: Kevin Stock (teoryn) Date: 2011年07月24日 18:35
Given the input '<x><y z=""o"" /></x>', HTMLParser only detects the opening x tag, and then stops parsing. Ideally this should behave like the case '<x><y z="""" /></x>' which raises an error and then can continue parsing the close x tag.
msg141174 - (view) Author: Kevin Stock (teoryn) Date: 2011年07月26日 18:07
A workaround is to call close() after feed(), which I supposed I should have done anyways. However, this does not resolve the issue that the two cases behave so differently. 
The code that causes the difference is lines 351-355 of parser.py, which also has a misleading comment stating it detects the / in a /> ending (which is actually done at 334).
msg146774 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011年11月01日 13:21
I think <x><y z=""o"" /></x> should be parser as <x><y z="" /></x>, and the o"" should be ignored.
<x><y z="""" /></x> should be parser as <x><y z="" /></x>, and the last two "" should be ignored. This is what Firefox seems to do.
Currently the parser doesn't seem to handle extraneous data in the start tag too well, because the locatestarttagend_tolerant regex looks for (more or less) well-formed attributes.
Attached a patch for test_htmlparser with the two examples provided by Kevin.
msg146848 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011年11月02日 16:53
> This is what Firefox seems to do.
I think more confidence would be good. Doesn’t the HTML5 spec define that? Have you found their test suite? Do you have more than one browser known to be compliant (trick: not sure there is even one)?
msg146852 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011年11月02日 17:08
I haven't found anything in the HTML5 spec but I haven't looked closely.
I'll do some more research when I'll start working on an actual patch.
msg147192 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011年11月06日 22:51
http://www.w3.org/TR/html5/tokenization.html#before-attribute-name-state 
msg147612 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011年11月14日 16:57
New changeset 3c3009f63700 by Ezio Melotti in branch '2.7':
#1745761, #755670, #13357, #12629, #1200313: improve attribute handling in HTMLParser.
http://hg.python.org/cpython/rev/3c3009f63700
New changeset 16ed15ff0d7c by Ezio Melotti in branch '3.2':
#1745761, #755670, #13357, #12629, #1200313: improve attribute handling in HTMLParser.
http://hg.python.org/cpython/rev/16ed15ff0d7c
New changeset 426f7a2b1826 by Ezio Melotti in branch 'default':
#1745761, #755670, #13357, #12629, #1200313: merge with 3.2.
http://hg.python.org/cpython/rev/426f7a2b1826 
msg147620 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011年11月14日 17:12
Fixed, thanks for the report!
Apparently the correct way to parse <y z=""o"" /> is:
starttag y
attribute z with value ""
attribute o"" with no value
So this is what HTMLParser does now.
History
Date User Action Args
2022年04月11日 14:57:20adminsetgithub: 56838
2011年11月14日 17:12:07ezio.melottisetstatus: open -> closed
versions: + Python 2.7
messages: + msg147620

resolution: fixed
stage: needs patch -> resolved
2011年11月14日 16:57:13python-devsetnosy: + python-dev
messages: + msg147612
2011年11月14日 12:44:28ezio.melottisetassignee: ezio.melotti
2011年11月06日 22:51:14ezio.melottisetmessages: + msg147192
2011年11月02日 17:08:43ezio.melottisetmessages: + msg146852
2011年11月02日 16:53:00eric.araujosetmessages: + msg146848
2011年11月01日 13:21:41ezio.melottisetfiles: + issue12629.diff

nosy: + ezio.melotti
messages: + msg146774

keywords: + patch
stage: needs patch
2011年07月29日 16:24:57eric.araujosetnosy: + eric.araujo, r.david.murray
2011年07月26日 18:07:46teorynsetmessages: + msg141174
2011年07月24日 18:35:07teoryncreate

AltStyle によって変換されたページ (->オリジナル) /