homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: HTMLParser.locatestartagend regex too stringent
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: duplicate
Dependencies: Superseder: HTMLParser : A auto-tolerant parsing mode
View: 1486713
Assigned To: Nosy List: ajaksu2, dyoo, r.david.murray
Priority: normal Keywords: easy, patch

Created on 2004年11月01日 18:05 by dyoo, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
HTMLParser.py.diff dyoo, 2004年11月01日 18:05 diff against Lib/HTMLParser.py from Python 2.3.3
Messages (6)
msg22976 - (view) Author: Danny Yoo (dyoo) Date: 2004年11月01日 18:05
In Python 2.3.3, HTMLParser uses a certain regex that
is too stringent, and it does not capture slightly
malformed HTML gracefully.
The current definition of HTMLParser.locatestartendtag:
locatestarttagend = re.compile(r"""
 <[a-zA-Z][-.a-zA-Z0-9:_]* # tag name
 (?:\s+ # whitespace before attribute
name
 (?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name
 (?:\s*=\s* # value indicator
 (?:'[^']*' # LITA-enclosed value
 |\"[^\"]*\" # LIT-enclosed value
 |[^'\">\s]+ # bare value
 )
 )?
 )
 )*
 \s* # trailing whitespace
""", re.VERBOSE)
does not capture strings like:
 <IMG SRC = "abc.jpg"WIDTH=5>
where there is no space between the closing quote and
the next attribute name. Many sources of HTML are
slightly malformed this way --- in particular, CNN.com
--- so being slightly lenient might be good. We can
slightly relax the constraint:
locatestarttagend = re.compile(r"""
 <[a-zA-Z][-.a-zA-Z0-9:_]* # tag name
 (?:\s* # optional whitespace before
attribute name
 (?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name
 (?:\s*=\s* # value indicator
 (?:'[^']*' # LITA-enclosed value
 |\"[^\"]*\" # LIT-enclosed value
 |[^'\">\s]+ # bare value
 )
 )?
 )
 )*
 \s* # trailing whitespace
""", re.VERBOSE)
which allows the parser to process more of the HTML out
there.
See:
http://mail.python.org/pipermail/tutor/2004-October/032835.html
and:
http://mail.python.org/pipermail/tutor/2004-October/032869.html
for an explanation of what motivates this change.
Thanks!
msg82104 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2009年02月14日 18:45
The regex is still the same. This is one of many 'HTMLParser regex for
attributes' issues.
msg114390 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010年08月19日 18:21
I'll close this in a couple of weeks unless anyone objects.
msg115604 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010年09月04日 18:44
No reply to msg114390.
msg115623 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010年09月05日 02:45
Closing this issue as out of date was inappropriate. It may be a duplicate, but someone with an interest should go through and evaluate all the related 'tolerant HTML parser' issues.
Issue 1486713 could perhaps serve as a master issue for this set.
msg123168 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010年12月03日 03:00
Closing this in favor of 1486713, which has a patch and covers additional issues.
History
Date User Action Args
2022年04月11日 14:56:07adminsetgithub: 41113
2010年12月03日 03:00:05r.david.murraysetstatus: open -> closed

superseder: HTMLParser : A auto-tolerant parsing mode

nosy: - BreamoreBoy
messages: + msg123168
resolution: duplicate
stage: test needed -> resolved
2010年09月05日 02:45:05r.david.murraysetstatus: closed -> open

nosy: + r.david.murray
messages: + msg115623

resolution: out of date -> (no value)
2010年09月04日 18:47:21BreamoreBoysetstatus: open -> closed
resolution: out of date
2010年09月04日 18:44:33BreamoreBoysetstatus: pending -> open

messages: + msg115604
2010年08月19日 18:21:51BreamoreBoysetstatus: open -> pending
versions: + Python 3.2, - Python 2.7
nosy: + BreamoreBoy

messages: + msg114390
2009年02月14日 18:45:26ajaksu2setversions: + Python 2.7, - Python 2.3
nosy: + ajaksu2
messages: + msg82104
keywords: + patch, easy
type: enhancement
stage: test needed
2004年11月01日 18:05:39dyoocreate

AltStyle によって変換されたページ (->オリジナル) /