Issue 13711: html.parser.HTMLParser doesn't parse tags in comments in scripts correctly

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/57920

classification

Title:	html.parser.HTMLParser doesn't parse tags in comments in scripts correctly
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.2

process

Status:	closed	Resolution:	duplicate
Dependencies:	Superseder:	HTMLParser.py - more robust SCRIPT tag parsing View: 670664
Assigned To:	ezio.melotti	Nosy List:	ezio.melotti, r.david.murray, turion
Priority:	normal	Keywords:

Created on 2012年01月04日 13:26 by turion, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
htmlparserbug.py	turion, 2012年01月04日 13:26	Script demonstrating the bug

Messages (8)
msg150603 - (view)	Author: Manuel Bärenz (turion)	Date: 2012年01月04日 13:26
I've attached a script which demonstrates the bug. When feeding a <script> that contains a comment tag with the actual script and the script containing tags itself (e.g. a 'document.write(<td></td>)'), the parser doesn't call handle_comment and handle_starttag.
msg150604 - (view)	Author: Manuel Bärenz (turion)	Date: 2012年01月04日 13:38
I forgot to say, I'm using python version 3.2.2.
msg150605 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2012年01月04日 13:55
The content of a script tag is CDATA. Why would you expect it to be parsed?
msg150606 - (view)	Author: Manuel Bärenz (turion)	Date: 2012年01月04日 14:25
Oh, I wasn't aware of that. Then, the bug is actually calling handle_endtag.
msg150607 - (view)	Author: Manuel Bärenz (turion)	Date: 2012年01月04日 14:28
To clarify this even further: Consider parser_instance.feed("<script><td></td></script>") It should call: parser_instance.handle_starttag("script", []) parser_instance.handle_data("<td></td>") parser_instance.handle_endtag("script", []) Instead, it calls: parser_instance.handle_starttag("script", []) parser_instance.handle_data("<td>") parser_instance.handle_endtag("td", []) parser_instance.handle_endtag("script", [])
msg150608 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2012年01月04日 14:42
I believe this was fixed recently as part of issue 670664. Ezio will know for sure.
msg150611 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2012年01月04日 15:02
Yep, this was fixed in #670664. With the development version of Python (AFAIK the fix has not be released yet) and the example parser found in the doc[0] I get this: >>> parser = MyHTMLParser() >>> parser.feed('<script><td></td></script>') Encountered a start tag: script Encountered some data: <td></td> Encountered an end tag: script [0]: http://docs.python.org/dev/library/html.parser.html#example-html-parser-application
msg150614 - (view)	Author: Manuel Bärenz (turion)	Date: 2012年01月04日 16:19
Great! Thank you!

History
Date	User	Action	Args
2022年04月11日 14:57:25	admin	set	github: 57920
2012年01月04日 16:19:16	turion	set	messages: + msg150614
2012年01月04日 15:02:17	ezio.melotti	set	status: open -> closed superseder: HTMLParser.py - more robust SCRIPT tag parsing messages: + msg150611 assignee: ezio.melotti resolution: duplicate stage: resolved
2012年01月04日 14:42:30	r.david.murray	set	messages: + msg150608
2012年01月04日 14:28:47	turion	set	messages: + msg150607
2012年01月04日 14:25:35	turion	set	messages: + msg150606
2012年01月04日 13:55:44	r.david.murray	set	nosy: + ezio.melotti, r.david.murray messages: + msg150605
2012年01月04日 13:38:27	turion	set	messages: + msg150604
2012年01月04日 13:26:46	turion	create

homepage