Issue 20288: HTMLParse handing of non-numeric charrefs broken

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/64487

classification

Title:	HTMLParse handing of non-numeric charrefs broken
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.3, Python 3.4, Python 2.7

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	ezio.melotti	Nosy List:	ezio.melotti, iko, python-dev, r.david.murray
Priority:	normal	Keywords:	patch

Created on 2014年01月17日 14:06 by iko, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue20288.diff	ezio.melotti, 2014年02月01日 19:13

Messages (5)
msg208336 - (view)	Author: Anders Hammarquist (iko)	Date: 2014年01月17日 14:06
Python 2.7 HTMLParse.py lines 185-199 (similar lines still exist in Python 3.4) match = charref.match(rawdata, i) if match: ... else: if ";" in rawdata[i:]: #bail by consuming &# self.handle_data(rawdata[0:2]) i = self.updatepos(i, 2) break if you feed a broken charref, that is non-numeric, it will pass whatever random string that happened to be at the start of rawdata to handle_data(). Eg: p = HTMLParser() p.handle_data = lambda x: sys.stdout.write(x) p.feed('<p>&#foo;</p>') will print '<p' which is clearly wrong. I think the intention of the code is to pass '&#', which seems saner.
msg208350 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2014年01月17日 18:35
Thanks for the report, this is indeed a bug. This behavior was covered by a test (see Lib/test/test_htmlparser.py:164), but _run_check feeds the chars one by one to the parser, and in that case it works correctly. While feeding the parser a whole chunk I was able to reproduce the bug. This should be fixed, and the behavior of _run_check should probably be changed too -- maybe it could test both the char-by-char and the regular feeding.
msg209911 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2014年02月01日 19:13
Here's a patch against 2.7.
msg209914 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2014年02月01日 19:23
New changeset 0d50b5851f38 by Ezio Melotti in branch '2.7': #20288: fix handling of invalid numeric charrefs in HTMLParser. http://hg.python.org/cpython/rev/0d50b5851f38 New changeset 32097f193892 by Ezio Melotti in branch '3.3': #20288: fix handling of invalid numeric charrefs in HTMLParser. http://hg.python.org/cpython/rev/32097f193892 New changeset 92b3928bfde1 by Ezio Melotti in branch 'default': #20288: merge with 3.3. http://hg.python.org/cpython/rev/92b3928bfde1
msg211202 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2014年02月14日 05:31
This is now fixed, thanks for the report! > This should be fixed, and the behavior of _run_check should probably be > changed too -- maybe it could test both the char-by-char and the > regular feeding. I created #20623 to track this.

History
Date	User	Action	Args
2022年04月11日 14:57:57	admin	set	github: 64487
2014年02月14日 05:31:06	ezio.melotti	set	status: open -> closed resolution: fixed messages: + msg211202 stage: needs patch -> resolved
2014年02月01日 19:23:11	python-dev	set	nosy: + python-dev messages: + msg209914
2014年02月01日 19:13:40	ezio.melotti	set	files: + issue20288.diff keywords: + patch messages: + msg209911
2014年01月17日 18:35:24	ezio.melotti	set	versions: + Python 2.7, Python 3.3, Python 3.4 nosy: + r.david.murray messages: + msg208350 stage: needs patch
2014年01月17日 14:18:40	ezio.melotti	set	assignee: ezio.melotti nosy: + ezio.melotti
2014年01月17日 14:06:13	iko	create

homepage