homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author dyoo
Recipients
Date 2004年11月01日.18:05:39
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
In Python 2.3.3, HTMLParser uses a certain regex that
is too stringent, and it does not capture slightly
malformed HTML gracefully.
The current definition of HTMLParser.locatestartendtag:
locatestarttagend = re.compile(r"""
 <[a-zA-Z][-.a-zA-Z0-9:_]* # tag name
 (?:\s+ # whitespace before attribute
name
 (?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name
 (?:\s*=\s* # value indicator
 (?:'[^']*' # LITA-enclosed value
 |\"[^\"]*\" # LIT-enclosed value
 |[^'\">\s]+ # bare value
 )
 )?
 )
 )*
 \s* # trailing whitespace
""", re.VERBOSE)
does not capture strings like:
 <IMG SRC = "abc.jpg"WIDTH=5>
where there is no space between the closing quote and
the next attribute name. Many sources of HTML are
slightly malformed this way --- in particular, CNN.com
--- so being slightly lenient might be good. We can
slightly relax the constraint:
locatestarttagend = re.compile(r"""
 <[a-zA-Z][-.a-zA-Z0-9:_]* # tag name
 (?:\s* # optional whitespace before
attribute name
 (?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name
 (?:\s*=\s* # value indicator
 (?:'[^']*' # LITA-enclosed value
 |\"[^\"]*\" # LIT-enclosed value
 |[^'\">\s]+ # bare value
 )
 )?
 )
 )*
 \s* # trailing whitespace
""", re.VERBOSE)
which allows the parser to process more of the HTML out
there.
See:
http://mail.python.org/pipermail/tutor/2004-October/032835.html
and:
http://mail.python.org/pipermail/tutor/2004-October/032869.html
for an explanation of what motivates this change.
Thanks!
History
Date User Action Args
2007年08月23日 14:27:13adminlinkissue1058305 messages
2007年08月23日 14:27:13admincreate

AltStyle によって変換されたページ (->オリジナル) /