homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: HTMLParse handing of non-numeric charrefs broken
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.3, Python 3.4, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: ezio.melotti, iko, python-dev, r.david.murray
Priority: normal Keywords: patch

Created on 2014年01月17日 14:06 by iko, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue20288.diff ezio.melotti, 2014年02月01日 19:13
Messages (5)
msg208336 - (view) Author: Anders Hammarquist (iko) Date: 2014年01月17日 14:06
Python 2.7 HTMLParse.py lines 185-199 (similar lines still exist in Python 3.4)
 match = charref.match(rawdata, i)
 if match:
 ...
 else:
 if ";" in rawdata[i:]: #bail by consuming &#
 self.handle_data(rawdata[0:2])
 i = self.updatepos(i, 2)
 break
if you feed a broken charref, that is non-numeric, it will pass whatever random string that happened to be at the start of rawdata to handle_data(). Eg:
p = HTMLParser()
p.handle_data = lambda x: sys.stdout.write(x)
p.feed('<p>&#foo;</p>')
will print '<p' which is clearly wrong. I think the intention of the code is to pass '&#', which seems saner.
msg208350 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014年01月17日 18:35
Thanks for the report, this is indeed a bug.
This behavior was covered by a test (see Lib/test/test_htmlparser.py:164), but _run_check feeds the chars one by one to the parser, and in that case it works correctly. While feeding the parser a whole chunk I was able to reproduce the bug. This should be fixed, and the behavior of _run_check should probably be changed too -- maybe it could test both the char-by-char and the regular feeding.
msg209911 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014年02月01日 19:13
Here's a patch against 2.7.
msg209914 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014年02月01日 19:23
New changeset 0d50b5851f38 by Ezio Melotti in branch '2.7':
#20288: fix handling of invalid numeric charrefs in HTMLParser.
http://hg.python.org/cpython/rev/0d50b5851f38
New changeset 32097f193892 by Ezio Melotti in branch '3.3':
#20288: fix handling of invalid numeric charrefs in HTMLParser.
http://hg.python.org/cpython/rev/32097f193892
New changeset 92b3928bfde1 by Ezio Melotti in branch 'default':
#20288: merge with 3.3.
http://hg.python.org/cpython/rev/92b3928bfde1 
msg211202 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014年02月14日 05:31
This is now fixed, thanks for the report!
> This should be fixed, and the behavior of _run_check should probably be
> changed too -- maybe it could test both the char-by-char and the
> regular feeding.
I created #20623 to track this.
History
Date User Action Args
2022年04月11日 14:57:57adminsetgithub: 64487
2014年02月14日 05:31:06ezio.melottisetstatus: open -> closed
resolution: fixed
messages: + msg211202

stage: needs patch -> resolved
2014年02月01日 19:23:11python-devsetnosy: + python-dev
messages: + msg209914
2014年02月01日 19:13:40ezio.melottisetfiles: + issue20288.diff
keywords: + patch
messages: + msg209911
2014年01月17日 18:35:24ezio.melottisetversions: + Python 2.7, Python 3.3, Python 3.4
nosy: + r.david.murray

messages: + msg208350

stage: needs patch
2014年01月17日 14:18:40ezio.melottisetassignee: ezio.melotti

nosy: + ezio.melotti
2014年01月17日 14:06:13ikocreate

AltStyle によって変換されたページ (->オリジナル) /