This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2014年01月17日 14:06 by iko, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| issue20288.diff | ezio.melotti, 2014年02月01日 19:13 | |||
| Messages (5) | |||
|---|---|---|---|
| msg208336 - (view) | Author: Anders Hammarquist (iko) | Date: 2014年01月17日 14:06 | |
Python 2.7 HTMLParse.py lines 185-199 (similar lines still exist in Python 3.4)
match = charref.match(rawdata, i)
if match:
...
else:
if ";" in rawdata[i:]: #bail by consuming &#
self.handle_data(rawdata[0:2])
i = self.updatepos(i, 2)
break
if you feed a broken charref, that is non-numeric, it will pass whatever random string that happened to be at the start of rawdata to handle_data(). Eg:
p = HTMLParser()
p.handle_data = lambda x: sys.stdout.write(x)
p.feed('<p>&#foo;</p>')
will print '<p' which is clearly wrong. I think the intention of the code is to pass '&#', which seems saner.
|
|||
| msg208350 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2014年01月17日 18:35 | |
Thanks for the report, this is indeed a bug. This behavior was covered by a test (see Lib/test/test_htmlparser.py:164), but _run_check feeds the chars one by one to the parser, and in that case it works correctly. While feeding the parser a whole chunk I was able to reproduce the bug. This should be fixed, and the behavior of _run_check should probably be changed too -- maybe it could test both the char-by-char and the regular feeding. |
|||
| msg209911 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2014年02月01日 19:13 | |
Here's a patch against 2.7. |
|||
| msg209914 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2014年02月01日 19:23 | |
New changeset 0d50b5851f38 by Ezio Melotti in branch '2.7': #20288: fix handling of invalid numeric charrefs in HTMLParser. http://hg.python.org/cpython/rev/0d50b5851f38 New changeset 32097f193892 by Ezio Melotti in branch '3.3': #20288: fix handling of invalid numeric charrefs in HTMLParser. http://hg.python.org/cpython/rev/32097f193892 New changeset 92b3928bfde1 by Ezio Melotti in branch 'default': #20288: merge with 3.3. http://hg.python.org/cpython/rev/92b3928bfde1 |
|||
| msg211202 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2014年02月14日 05:31 | |
This is now fixed, thanks for the report! > This should be fixed, and the behavior of _run_check should probably be > changed too -- maybe it could test both the char-by-char and the > regular feeding. I created #20623 to track this. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:57 | admin | set | github: 64487 |
| 2014年02月14日 05:31:06 | ezio.melotti | set | status: open -> closed resolution: fixed messages: + msg211202 stage: needs patch -> resolved |
| 2014年02月01日 19:23:11 | python-dev | set | nosy:
+ python-dev messages: + msg209914 |
| 2014年02月01日 19:13:40 | ezio.melotti | set | files:
+ issue20288.diff keywords: + patch messages: + msg209911 |
| 2014年01月17日 18:35:24 | ezio.melotti | set | versions:
+ Python 2.7, Python 3.3, Python 3.4 nosy: + r.david.murray messages: + msg208350 stage: needs patch |
| 2014年01月17日 14:18:40 | ezio.melotti | set | assignee: ezio.melotti nosy: + ezio.melotti |
| 2014年01月17日 14:06:13 | iko | create | |