This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2003年06月17日 03:09 by smroid, last changed 2022年04月10日 16:09 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| diff.txt | smroid, 2003年06月17日 03:09 | |||
| issue755670.diff | ezio.melotti, 2011年11月01日 16:11 | Failing test | review | |
| Messages (11) | |||
|---|---|---|---|
| msg44032 - (view) | Author: Steven Rosenthal (smroid) | Date: 2003年06月17日 03:09 | |
HTML examples seen in the wild that cause parse errors
in HTMLParser include:
<a width="100%"cellspacing=0>
-- note lack of space between val and next attr name
<a foo=>
-- trailing attribute has no value after =
<a href=javascript:popup('/popup/html.html')>
-- javascript fragment with embedded quotes
My patch contains improvements to the 'attrfind' and
'locatestarttagend' regexps that allow these examples
to parse.
The existing test_htmlparser.py unit test continues to
pass, except for the one test case where it considers
<a foo=> to be an error.
I commented out that case and added new test cases to
cover the examples above.
|
|||
| msg44033 - (view) | Author: Steven Rosenthal (smroid) | Date: 2003年06月17日 03:10 | |
Logged In: YES user_id=159908 Base version for HTMLParser.py is 1.11.2.1; base version for test_htmlparser.py is 1.8.8.1 |
|||
| msg44034 - (view) | Author: Steven Rosenthal (smroid) | Date: 2003年06月18日 03:22 | |
Logged In: YES user_id=159908 This also fixes bugs 683938 and 699079. |
|||
| msg44035 - (view) | Author: Titus Brown (titus) | Date: 2004年12月19日 00:42 | |
Logged In: YES user_id=23486 This patch allows developers to override the behavior of HTMLParser when parsing malformed HTML. Normally HTMLParser calls the function self.error(), which raises an exception. This patch adds appropriate return values for situations where self.error has been redefined in subclasses to *not* raise an exception. It does not change the default behavior of HTMLParser and so presents no backwards compatibility issues. The patch itself consists of an added comment and two added lines of code that call 'return' with appropriate values after a self.error call. Nothing wrong with 'em. I can't verify that the "junk characters" error call will leave the parser in a good state, though, if execution returns from error(). The library documentation could be updated to reflect the ability to override error() behavior; I've written a short patch, available at http://issola.caltech.edu/~t/transfer/HTMLParser-doc-error.patch More problems exist with markupbase.py, upon which HTMLParser is based. markupbase calls error() as well, and has some stickier situations. See comments in bug 917188 as well. Comments in 683938 and 699079 suggest that raising an exception is the correct response to the parse errors. I recommend application of the patch anyway, because it (a) doesn't change any behavior by default and (b) may solve some problems for people. An alternative would be to distinguish between unrecoverable errors and recoverable errors by having two different functions, e.g. error() (for recoverable errors) and _fail() (for unrecoverable errors). By default error() would call _fail() and internal code could be changed to call _fail() where recovery is impossible. This might alter behavior in situations where subclasses override error() but then again that's not legitimate to do anyway, at least not at the moment -- error() isn't in the docs ;). If nothing done, at least close patch 755660 and bug 736428 with a comment saying that this behavior will not be addressed ;). |
|||
| msg44036 - (view) | Author: Titus Brown (titus) | Date: 2004年12月19日 00:45 | |
Logged In: YES user_id=23486 whoops, attached to wrong patch! dangitall. sorry... |
|||
| msg44037 - (view) | Author: Titus Brown (titus) | Date: 2004年12月19日 06:58 | |
Logged In: YES user_id=23486 I don't think HTMLParser should parse clearly invalid HTML without complaining. Perhaps such behavior should be overrideable (see patch 755660 & comments therein), but this patch changes behavior to parse invalid HTML w/o complaint. Patch 699079, to "fix" similar behavior, was closed by adsr and bcannon because such behavior is correct. Patch 683938 is also similar but is being kept open for some reason. Recommend closing patch w/o applying. |
|||
| msg55984 - (view) | Author: T. Middleton (timtoo) | Date: 2007年09月17日 21:20 | |
I for one thank smroid for the patch. I also have hit *all* of these cases in the wild. This patch makes real-life a lot less frustrating. This patch is surely a lot more preferable than HTMLParser's tendency to just throw up its hands and quietly give up. And one might even make a case that in the horrific world of HTML "standards" case #2 and #3 can be considered actually valid (as much as it hurts to say so). |
|||
| msg114245 - (view) | Author: Mark Lawrence (BreamoreBoy) * | Date: 2010年08月18日 16:32 | |
There are messages both for and against the patch which contains a unit test. Can we have a statement from a knowledgeable HTML person as to whether the patch should be accepted or rejected. |
|||
| msg146786 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2011年11月01日 16:11 | |
Attached patch includes the tests in diff.txt. On Python 3, with strict=False, the first test (adjacent attributes) passes, but the other two still fail. See also #12629. |
|||
| msg147611 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2011年11月14日 16:57 | |
New changeset 3c3009f63700 by Ezio Melotti in branch '2.7': #1745761, #755670, #13357, #12629, #1200313: improve attribute handling in HTMLParser. http://hg.python.org/cpython/rev/3c3009f63700 New changeset 16ed15ff0d7c by Ezio Melotti in branch '3.2': #1745761, #755670, #13357, #12629, #1200313: improve attribute handling in HTMLParser. http://hg.python.org/cpython/rev/16ed15ff0d7c New changeset 426f7a2b1826 by Ezio Melotti in branch 'default': #1745761, #755670, #13357, #12629, #1200313: merge with 3.2. http://hg.python.org/cpython/rev/426f7a2b1826 |
|||
| msg147619 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2011年11月14日 17:07 | |
Fixed, thanks for the report! |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月10日 16:09:16 | admin | set | github: 38664 |
| 2011年11月14日 17:07:52 | ezio.melotti | set | status: open -> closed versions: + Python 2.7 messages: + msg147619 dependencies: - allow HTMLParser to continue after a parse error resolution: fixed stage: patch review -> resolved |
| 2011年11月14日 16:57:13 | python-dev | set | nosy:
+ python-dev messages: + msg147611 |
| 2011年11月14日 12:45:57 | ezio.melotti | set | assignee: ezio.melotti |
| 2011年11月01日 16:11:55 | ezio.melotti | set | files:
+ issue755670.diff versions: + Python 3.3 nosy: + ezio.melotti, - BreamoreBoy messages: + msg146786 type: enhancement -> behavior |
| 2010年08月18日 16:32:10 | BreamoreBoy | set | versions:
+ Python 3.2, - Python 2.7 nosy: + BreamoreBoy messages: + msg114245 stage: test needed -> patch review |
| 2009年02月12日 03:01:19 | ajaksu2 | set | dependencies:
+ HTMLParser attribute parsing bug, allow HTMLParser to continue after a parse error type: enhancement stage: test needed versions: + Python 2.7, - Python 2.3 |
| 2007年09月17日 21:20:47 | timtoo | set | nosy:
+ timtoo messages: + msg55984 |
| 2003年06月17日 03:09:17 | smroid | create | |