Message 208336 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	iko
Recipients	iko
Date	2014年01月17日.14:06:13
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1389967573.45.0.115549710544.issue20288@psf.upfronthosting.co.za>

Content
Python 2.7 HTMLParse.py lines 185-199 (similar lines still exist in Python 3.4) match = charref.match(rawdata, i) if match: ... else: if ";" in rawdata[i:]: #bail by consuming &# self.handle_data(rawdata[0:2]) i = self.updatepos(i, 2) break if you feed a broken charref, that is non-numeric, it will pass whatever random string that happened to be at the start of rawdata to handle_data(). Eg: p = HTMLParser() p.handle_data = lambda x: sys.stdout.write(x) p.feed('<p>&#foo;</p>') will print '<p' which is clearly wrong. I think the intention of the code is to pass '&#', which seems saner.

Content

Python 2.7 HTMLParse.py lines 185-199 (similar lines still exist in Python 3.4)
 match = charref.match(rawdata, i)
 if match:
 ...
 else:
 if ";" in rawdata[i:]: #bail by consuming &#
 self.handle_data(rawdata[0:2])
 i = self.updatepos(i, 2)
 break
if you feed a broken charref, that is non-numeric, it will pass whatever random string that happened to be at the start of rawdata to handle_data(). Eg:
p = HTMLParser()
p.handle_data = lambda x: sys.stdout.write(x)
p.feed('<p>&#foo;</p>')
will print '<p' which is clearly wrong. I think the intention of the code is to pass '&#', which seems saner.

History
Date	User	Action	Args
2014年01月17日 14:06:13	iko	set	recipients: + iko
2014年01月17日 14:06:13	iko	set	messageid: <1389967573.45.0.115549710544.issue20288@psf.upfronthosting.co.za>
2014年01月17日 14:06:13	iko	link	issue20288 messages
2014年01月17日 14:06:13	iko	create

homepage