Issue 6191: HTMLParser attribute parsing - 2 test cases when it fails

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/50441

classification

Title:	HTMLParser attribute parsing - 2 test cases when it fails
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.2, Python 3.3

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	Nosy List:	ezio.melotti, georg.brandl, momat, r.david.murray
Priority:	normal	Keywords:

Created on 2009年06月04日 07:46 by momat, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Messages (12)
msg88867 - (view)	Author: Paweł Widera (momat)	Date: 2009年06月04日 07:46
Of course both are not correct HTML but are easy to guess, so I believe the parser should not give up too quick here. 1) extra comma between attributes <form action="/xxx.php?a=1&b=2&amp", method="post"> 2) missing closing quotation mark for the first attribute <a href="http://xxx.org/xxx.php?a=1 target="_blank">click me</a>
msg88899 - (view)	Author: Georg Brandl (georg.brandl) * (Python committer)	Date: 2009年06月04日 19:13
I do not think HTMLParser should guess. Guessing always opens the door to misinterpretation.
msg88903 - (view)	Author: Paweł Widera (momat)	Date: 2009年06月04日 21:36
It depends whether you want a HTMLParser to be an useful tool that can deal with real world HTML or just a toy without practical meaning. Crashing on every little deviation from the standard, where more relaxed approach is possible, doesn't sound to me as a reasonable choice. Maybe guess is not a proper word... If the standard strict approach fails, the parser should fall back to a less strict one in an attempt to actually parse the document. Throwing an exception and giving up is just not good enough. Can we have somebody else commenting on this one please?
msg88906 - (view)	Author: Georg Brandl (georg.brandl) * (Python committer)	Date: 2009年06月04日 21:42
> Throwing an exception and giving up is just not good enough. Yes it is, in some cases. There are "forgiving" HTML parsers out there, HTMLParser does not strive to be one. There are so many cases where HTML is a bit malformed that it takes more than just two exceptions to get it right. It's for a reason that browsers' parsers are so complex. If you add these corner cases, people will come asking for this exception, and that one, etc.
msg88910 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2009年06月04日 21:50
In doing web scraping I started using BeautifulSoup precisely because it was very lenient in what html it accepted (I haven't written such an ap for a while, so I'm not sure what BeautifulSoup currently does...I thought I heard it was now using HTMLParser...). There are a lot of messed up web pages out there. I don't have time right now to evaluate your particular cases, but my rule of thumb would be that if the major web browsers do something "reasonable" with these cases, then a python tool designed to read web pages should do so as well, where possible. ("Be liberal in what you accept, and strict in what you generate.") That said, I'm not sure what HTMLParser's design goals are, so this may not be an appropriate goal for the module.
msg88913 - (view)	Author: Georg Brandl (georg.brandl) * (Python committer)	Date: 2009年06月04日 22:22
So BeautifulSoup is using HTMLParser? That is interesting, because they claim to support "broken" HTML. In any case, if a "quirky" mode is added, it should have to be turned on explicitly by a flag.
msg89018 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2009年06月06日 21:19
BeautifulSoup use SGMLParser for all the versions <3.1. BeautifulSoup 3.1 is supposed to be compatible with Python 3 and since SGMLParser is gone it's now using HTMLParser, but it's not able to handle some things anymore. For more information: http://www.crummy.com/software/BeautifulSoup/3.1-problems.html (FWIW I tried BeautifulSoup 3.1 but it failed where BeautifulSoup 3.0.7 was working so I came back to 3.0.7)
msg133715 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2011年04月14日 06:07
The first case has been fixed already in 1cbfeffea19f, the second case is not even handled by browsers, so I'm closing this.
msg133731 - (view)	Author: Paweł Widera (momat)	Date: 2011年04月14日 12:30
Great! With one "but"... the second case is handled by browsers. Browsers do not throw an exception on it as HTMLParser do. So improvement is definitely possible here. If it is worth an effort, it is not for me to judge.
msg133732 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2011年04月14日 12:39
So you are suggesting that <a href="http://xxx.org/xxx.php?a=1 target="_blank">click me</a> should result in an 'a' element with an href attribute equals to "http://xxx.org/xxx.php?a=1 target=" and then discard _blank" as extra data?
msg134229 - (view)	Author: Paweł Widera (momat)	Date: 2011年04月21日 17:20
No. As the value of the href attribute is not suppose to contain spaces, I'd rather expect the parser to assume that there is an ending " missing before the space.
msg135959 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2011年05月14日 06:36
What I described in my previous message is what Firefox does. If you think this should be changed, I suggest you to open another issue, possibly attaching a test case with the desired behavior and a patch to change it.

History
Date	User	Action	Args
2022年04月11日 14:56:49	admin	set	github: 50441
2011年05月14日 06:36:58	ezio.melotti	set	messages: + msg135959
2011年04月21日 17:20:41	momat	set	messages: + msg134229
2011年04月14日 12:39:55	ezio.melotti	set	messages: + msg133732
2011年04月14日 12:30:31	momat	set	messages: + msg133731
2011年04月14日 06:07:44	ezio.melotti	set	status: open -> closed resolution: fixed messages: + msg133715 stage: resolved
2011年04月05日 18:29:36	ezio.melotti	set	versions: + Python 3.2, Python 3.3, - Python 2.6
2009年06月06日 21:20:00	ezio.melotti	set	nosy: + ezio.melotti messages: + msg89018
2009年06月04日 22:22:24	georg.brandl	set	resolution: wont fix -> (no value) messages: + msg88913
2009年06月04日 21:50:26	r.david.murray	set	status: pending -> open priority: normal nosy: + r.david.murray messages: + msg88910
2009年06月04日 21:42:54	georg.brandl	set	status: open -> pending messages: + msg88906
2009年06月04日 21:36:32	momat	set	status: closed -> open messages: + msg88903
2009年06月04日 19:13:05	georg.brandl	set	status: open -> closed nosy: + georg.brandl messages: + msg88899 resolution: wont fix
2009年06月04日 07:46:49	momat	create

homepage