homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: allow HTMLParser error recovery
Type: enhancement Stage:
Components: Library (Lib) Versions:
process
Status: closed Resolution: duplicate
Dependencies: Superseder: allow HTMLParser to continue after a parse error
View: 755660
Assigned To: Nosy List: ajaksu2, georg.brandl, kingswood, smroid
Priority: normal Keywords:

Created on 2003年05月12日 11:37 by smroid, last changed 2022年04月10日 16:08 by admin. This issue is now closed.

Messages (5)
msg60329 - (view) Author: Steven Rosenthal (smroid) Date: 2003年05月12日 11:37
I'm using 2.3a2.
HTMLParser correctly raises a "malformed start tag"
error on:
<meta NAME=DESCRIPTION Content=Lands' End quality...
outerwear and more.> 
because my application is imprecise by nature (web
scraping), I want to be able to continue after such errors.
I can override the error() method to not raise an
exception. To make this work, I also needed to alter
HTMLParser.py, near line 316, to read as:
 self.updatepos(i, j)
 self.error("malformed start tag")
 return j # ADDED THIS LINE
 raise AssertionError("we should not get here!")
My enhancement request is for every place where
self.error() is called, to ensure that the "override
error() to not raise an exception" continuation
strategy works as well as can be hoped.
Thanks,
Steve
msg60330 - (view) Author: Frank Vorstenbosch (kingswood) Date: 2004年03月16日 09:53
Logged In: YES 
user_id=555155
Fixed by my patch against 2.3.3.
The patch adds recovery to ensure progress and tries to not
miss any data in the input.
The error() method is now commented as being overridable,
just def error(): pass to ignore any parsing errors.
msg60331 - (view) Author: Frank Vorstenbosch (kingswood) Date: 2004年04月03日 18:04
Logged In: YES 
user_id=555155
This problem is actually more widespread than previously
indicated. Not only do all calls to self.error where that
function returns need to cope with that, and recover (the
HTMLParser defines that every character in the input will be
visited exactly once), but other modules are also affected.
In particular, feeding HTML (from spam) with a tag <!12345>
into HTMLParser causes markupbase._scan_name to emit an
error that now needs to recover.
The patch in #917188 may be better than the one suggested
here as it deals with all places where self.error() can return.
More is needed to fix the problem completely.
In markupbase.py, at least this is necessary
--- markupbase.py.orig Sat Apr 03 17:43:48 2004
+++ markupbase.py Sat Apr 03 18:02:48 2004
@@ -377,6 +377,8 @@
 else:
 self.updatepos(declstartpos, i)
 self.error("expected name token")
+ return None,rawdata.find(">",i)
 # To be overridden -- handlers for unknown objects
 def unknown_decl(self, data):
msg81442 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2009年02月09日 06:16
Superseder: issue 755660.
msg85553 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2009年04月05日 18:45
Setting as superseder.
History
Date User Action Args
2022年04月10日 16:08:42adminsetgithub: 38487
2009年04月05日 18:45:17georg.brandlsetstatus: open -> closed

nosy: + georg.brandl
messages: + msg85553

superseder: allow HTMLParser to continue after a parse error
resolution: duplicate
2009年02月09日 06:16:29ajaksu2setnosy: + ajaksu2
messages: + msg81442
2003年05月12日 11:37:44smroidcreate

AltStyle によって変換されたページ (->オリジナル) /