This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2004年11月01日 18:05 by dyoo, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| HTMLParser.py.diff | dyoo, 2004年11月01日 18:05 | diff against Lib/HTMLParser.py from Python 2.3.3 | ||
| Messages (6) | |||
|---|---|---|---|
| msg22976 - (view) | Author: Danny Yoo (dyoo) | Date: 2004年11月01日 18:05 | |
In Python 2.3.3, HTMLParser uses a certain regex that is too stringent, and it does not capture slightly malformed HTML gracefully. The current definition of HTMLParser.locatestartendtag: locatestarttagend = re.compile(r""" <[a-zA-Z][-.a-zA-Z0-9:_]* # tag name (?:\s+ # whitespace before attribute name (?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name (?:\s*=\s* # value indicator (?:'[^']*' # LITA-enclosed value |\"[^\"]*\" # LIT-enclosed value |[^'\">\s]+ # bare value ) )? ) )* \s* # trailing whitespace """, re.VERBOSE) does not capture strings like: <IMG SRC = "abc.jpg"WIDTH=5> where there is no space between the closing quote and the next attribute name. Many sources of HTML are slightly malformed this way --- in particular, CNN.com --- so being slightly lenient might be good. We can slightly relax the constraint: locatestarttagend = re.compile(r""" <[a-zA-Z][-.a-zA-Z0-9:_]* # tag name (?:\s* # optional whitespace before attribute name (?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name (?:\s*=\s* # value indicator (?:'[^']*' # LITA-enclosed value |\"[^\"]*\" # LIT-enclosed value |[^'\">\s]+ # bare value ) )? ) )* \s* # trailing whitespace """, re.VERBOSE) which allows the parser to process more of the HTML out there. See: http://mail.python.org/pipermail/tutor/2004-October/032835.html and: http://mail.python.org/pipermail/tutor/2004-October/032869.html for an explanation of what motivates this change. Thanks! |
|||
| msg82104 - (view) | Author: Daniel Diniz (ajaksu2) * (Python triager) | Date: 2009年02月14日 18:45 | |
The regex is still the same. This is one of many 'HTMLParser regex for attributes' issues. |
|||
| msg114390 - (view) | Author: Mark Lawrence (BreamoreBoy) * | Date: 2010年08月19日 18:21 | |
I'll close this in a couple of weeks unless anyone objects. |
|||
| msg115604 - (view) | Author: Mark Lawrence (BreamoreBoy) * | Date: 2010年09月04日 18:44 | |
No reply to msg114390. |
|||
| msg115623 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2010年09月05日 02:45 | |
Closing this issue as out of date was inappropriate. It may be a duplicate, but someone with an interest should go through and evaluate all the related 'tolerant HTML parser' issues. Issue 1486713 could perhaps serve as a master issue for this set. |
|||
| msg123168 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2010年12月03日 03:00 | |
Closing this in favor of 1486713, which has a patch and covers additional issues. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:07 | admin | set | github: 41113 |
| 2010年12月03日 03:00:05 | r.david.murray | set | status: open -> closed superseder: HTMLParser : A auto-tolerant parsing mode nosy: - BreamoreBoy messages: + msg123168 resolution: duplicate stage: test needed -> resolved |
| 2010年09月05日 02:45:05 | r.david.murray | set | status: closed -> open nosy: + r.david.murray messages: + msg115623 resolution: out of date -> (no value) |
| 2010年09月04日 18:47:21 | BreamoreBoy | set | status: open -> closed resolution: out of date |
| 2010年09月04日 18:44:33 | BreamoreBoy | set | status: pending -> open messages: + msg115604 |
| 2010年08月19日 18:21:51 | BreamoreBoy | set | status: open -> pending versions: + Python 3.2, - Python 2.7 nosy: + BreamoreBoy messages: + msg114390 |
| 2009年02月14日 18:45:26 | ajaksu2 | set | versions:
+ Python 2.7, - Python 2.3 nosy: + ajaksu2 messages: + msg82104 keywords: + patch, easy type: enhancement stage: test needed |
| 2004年11月01日 18:05:39 | dyoo | create | |