Issue 3932: HTMLParser cannot handle '&' and non-ascii characters in attribute names

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/48182

classification

Title:	HTMLParser cannot handle '&' and non-ascii characters in attribute names
Type:	enhancement	Stage:	resolved
Components:	Documentation	Versions:	Python 2.7

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	ezio.melotti	Nosy List:	eric.araujo, ezio.melotti, hodgestar, python-dev, r.david.murray, rhettinger, sergiomb2, wiget, yanne, zchyla
Priority:	normal	Keywords:	patch

Created on 2008年09月22日 12:32 by yanne, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
test.py	yanne, 2008年10月03日 10:10	Fixed minimal script to produce the error
HTMLParser-unescape-fix.diff	zchyla, 2009年07月30日 07:27	HTMLParser.unescape: return str value for str input
issue3932-test.diff	ezio.melotti, 2011年11月07日 07:31	Failing test

Messages (10)
msg73571 - (view)	Author: (yanne)	Date: 2008年09月22日 12:32
It seems that HTMLParser.feed throws an exception whenever an attribute name contains both quotation mark '&' and non-ascii characters. Running the attached test file with Python 2.5 succeeds, but with Python 2.6, the result is: C:\Python26>python.exe test.py Without & in attribute OK With & in attribute Traceback (most recent call last): File "test.py", line 18, in <module> HP().feed(s) File "C:\Python26\lib\HTMLParser.py", line 108, in feed self.goahead(0) File "C:\Python26\lib\HTMLParser.py", line 148, in goahead k = self.parse_starttag(i) File "C:\Python26\lib\HTMLParser.py", line 249, in parse_starttag attrvalue = self.unescape(attrvalue) File "C:\Python26\lib\HTMLParser.py", line 386, in unescape return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+\|\w{1,8}));", replaceEntities, s) File "C:\Python26\lib\re.py", line 150, in sub return _compile(pattern, 0).sub(repl, string, count) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) I am running: Python 2.6rc2 (r26rc2:66507, Sep 18 2008, 14:27:33) [MSC v.1500 32 bit (Intel)] on win32
msg73908 - (view)	Author: Simon Cross (hodgestar)	Date: 2008年09月27日 00:05
I can't reproduce this on current trunk (r66633, 27 Sep 2008). I checked sys.getdefaultencoding() but that returned 'ascii' as expected and I even tried language Python with "LANG=C ./python" but that didn't fail either. Perhaps this has been fixed? It looks like it might originally have been a problem in the re module from the traceback.
msg74234 - (view)	Author: (yanne)	Date: 2008年10月03日 10:10
It seems that I managed to upload wrong test file the first time. This attached test should fail, I tested it with Python2.6 final both on Linux and Windows.
msg74239 - (view)	Author: Simon Cross (hodgestar)	Date: 2008年10月03日 11:09
I've tracked down the cause to the .unescape(...) method in HTMLParser. The replaceEntities function passed to re.sub() always returns a unicode character, even when matching string s is a byte string. Changing line 383 to: return self.entitydefs[s].encode("utf-8") makes the test pass. Unfortunately this is obviously not a viable solution in the general case. The problem is that there is no way to know what character set to encode in without knowing both the HTTP headers (which are not available to HTMLParser) and looking at the XML and HTML headers. Python 3.0 implicitly rejects non-unicode strings right at the start of html.parser.HTMLParser.feed(...) by adding '' to the data passed in. Given Python 3.0's behaviour, the docs should perhaps be updated to say HTMLParser does not support non-unicode strings? If it should support byte strings, we'll have to figure out how to handle encoded entity issues. It's a bit weird that character and entity references outside tags/attributes result in calls to .entityref(...) and .charref(...) while those inside get unescape called automatically. Don't really see what can be done about that though.
msg91084 - (view)	Author: Zbigniew Chyla (zchyla)	Date: 2009年07月30日 07:27
Since `HTMLParser.unescape` in 2.5 returns `str` for `str` input, 2.6 should remain compatible. Therefore I propose the attached patch (`HTMLParser-unescape-fix.diff`). With this patch applied the result will have the same type as the input.
msg96320 - (view)	Author: Sérgio (sergiomb2)	Date: 2009年12月13日 04:43
the patch fix parsing in simple tag a with title with <br> ?! and accents like this: <a href="8999.html" title="<br>país">
msg147189 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2011年11月06日 22:06
I'm not sure what is the best solution here. unescape uses a regex with replaceEntities as callback to replace the entities in attribute values. The problem is that replaceEntities currently returns unicode, and if unescape receives a str, an automatic coercion to unicode happens and an error is raised whenever the str is non-ascii. The possible solutions are: 1) Document the status quo (i.e replaceEntities always returns unicode, and an error is raised whenever a string that contains non-ascii chars is passed); 2) Change replaceEntities to return str only for ascii chars (as the patch proposed by Zbigniew does). This works as long as the entity resolves to an ascii character, but keep failing for the other cases. The first option is cleaner, and means that if you want to parse something you should always use unicode, otherwise it might fail (In case of ambiguity, refuse the temptation to guess). The second option might allow you to parse a few more documents without converting them to unicode, but only if you are lucky (i.e. you don't get any unicode mixed with non-ascii str). If most of the entities in attributes resolve to ascii (e.g. &quote; & ' > <), it might be more practical to return str and avoid unnecessary errors, while still adding a note in documentation that passing unicode is better.
msg148123 - (view)	Author: Éric Araujo (eric.araujo) * (Python committer)	Date: 2011年11月22日 15:30
+1 on refusing the temptation to guess and to be half-working for some cases by accident.
msg148544 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2011年11月29日 07:19
I'll change this in a doc issue then. Any suggestions about the wording? Adding "Passing unicode strings is suggested/advised/preferred." in the .feed() section is a bit vague, and mentioning the problem (with str it might break in some corner cases) while keeping a positive tone is somewhat difficult.
msg149817 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2011年12月19日 05:17
New changeset 978f45013c34 by Ezio Melotti in branch '2.7': #3932: suggest passing unicode to HTMLParser.feed(). http://hg.python.org/cpython/rev/978f45013c34

History
Date	User	Action	Args
2022年04月11日 14:56:39	admin	set	github: 48182
2012年03月13日 00:11:30	ezio.melotti	link	issue14251 superseder
2011年12月19日 05:20:04	ezio.melotti	set	status: open -> closed type: behavior -> enhancement resolution: fixed stage: needs patch -> resolved
2011年12月19日 05:17:29	python-dev	set	nosy: + python-dev messages: + msg149817
2011年11月29日 07:19:48	ezio.melotti	set	nosy: + rhettinger messages: + msg148544 components: + Documentation, - Library (Lib)
2011年11月22日 15:30:44	eric.araujo	set	messages: + msg148123
2011年11月14日 12:43:57	ezio.melotti	set	assignee: ezio.melotti
2011年11月07日 07:31:20	ezio.melotti	set	files: + issue3932-test.diff stage: needs patch
2011年11月06日 22:45:30	ezio.melotti	set	nosy: + eric.araujo
2011年11月06日 22:06:55	ezio.melotti	set	versions: - Python 2.6 nosy: + r.david.murray, ezio.melotti messages: + msg147189 type: behavior
2009年12月13日 04:43:54	sergiomb2	set	nosy: + sergiomb2 messages: + msg96320
2009年07月30日 07:31:15	wiget	set	nosy: + wiget
2009年07月30日 07:27:52	zchyla	set	files: + HTMLParser-unescape-fix.diff nosy: + zchyla messages: + msg91084 keywords: + patch
2008年10月03日 11:09:15	hodgestar	set	messages: + msg74239 versions: + Python 2.7
2008年10月03日 10:10:16	yanne	set	files: + test.py messages: + msg74234
2008年10月03日 10:08:19	yanne	set	files: - test.py
2008年09月27日 00:05:03	hodgestar	set	nosy: + hodgestar messages: + msg73908
2008年09月22日 12:32:10	yanne	create

homepage