This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2009年11月12日 16:25 by pluskid, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| issue7311.diff | ezio.melotti, 2011年03月26日 10:17 | Patch to allow non-ascii letters in attribute values (2.7) | review | |
| issue7311-2.diff | ezio.melotti, 2011年04月03日 17:30 | Patch that follow HTML5 specification for attr values (2.7) | review | |
| issue7311-3.diff | ezio.melotti, 2011年04月05日 18:51 | Patch that follows HTML5 specification for attr values (3.2) | review | |
| Messages (19) | |||
|---|---|---|---|
| msg95162 - (view) | Author: Chiyuan Zhang (pluskid) | Date: 2009年11月12日 16:25 | |
Hi all, I'm using BeautifulSoup to parsing an HTML page and find it refused to parse the page. By looking at the backtrace, I found it is a problem with the python built-in HTMLParser.py. In fact, the web page I'm parsing is with some Chinese characters. there is a tag like <img src=/foo/bar.png alt=中文> , note this is legacy html page where the attributes are not quoted. However, the regexp defined in HTMLParser.py is : attrfind = re.compile( r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*' r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?') Note that the Chinese character (also any other non-english characters), so it fire an error parsing this. I'm not sure whether the HTML standard allow un-quoted non-ASCII characters in the attributes. If it allows, this seems to be a bug. and the regexp to better be [^>\s] IMHO. BTW: It seems something like : <script> var st = "<a></"; </script> can not be parsed. :-/ |
|||
| msg95527 - (view) | Author: Glenn Linderman (v+python) * | Date: 2009年11月20日 02:59 | |
Re: the BTW -- < and > should be entity-escaped when used in attribute values inside tag attributes... (but are probably seldom found as part of tag attribute values) But the example you showed is not an attribute in a tag, but rather text within a paired tag. But your suggestion for the regexp seems correct to me, if the non-ASCII characters are permitted for non-quoted attribute values. |
|||
| msg95529 - (view) | Author: Chiyuan Zhang (pluskid) | Date: 2009年11月20日 04:37 | |
re: Yes. In fact, the BTW is a different problem with respect to this bug. And that seems to be more complicated to fix. |
|||
| msg132223 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2011年03月26日 10:17 | |
The attached patch changes the regex to allow non-ascii letters in attribute values (using \w with the re.UNICODE flag instead of [a-zA-Z0-9_]). Using [^>\s] (or even [^> ]) might be OK too, since that's what browsers seem to use (e.g. Firefox and Chrome show "テ<ス☃ト -d-fg" as title of '<a href="" title=テ<ス☃ト -d-fg href="">foo</a>', including the non-ascii spaces in the middle). |
|||
| msg132321 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2011年03月27日 13:57 | |
The HTML 4.01 specifications says[0]:
"""
In certain cases, authors may specify the value of an attribute without any quotation marks. The attribute value may only contain letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), periods (ASCII decimal 46), underscores (ASCII decimal 95), and colons (ASCII decimal 58). We recommend using quotation marks even when it is possible to eliminate them.
"""
The HTML 5 draft says[1]:
"""
The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN character, followed by zero or more space characters, followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal space characters, any U+0022 QUOTATION MARK characters ("), U+0027 APOSTROPHE characters ('), U+003D EQUALS SIGN characters (=), U+003C LESS-THAN SIGN characters (<), U+003E GREATER-THAN SIGN characters (>), or U+0060 GRAVE ACCENT characters (`), and must not be the empty string.
"""
So maybe [^>\s] is a little too permissive here.
[0]: http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.2
[1]: http://dev.w3.org/html5/spec/Overview.html#attributes-0
|
|||
| msg132864 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2011年04月03日 17:30 | |
Here's a patch that matches unquoted attribute values according to the HTML5 specifications. The regex uses \s even if this includes the \v char that, according to the HTML5 specs, shouldn't be included. I left it there for simplicity and backward-compatibility, and also because it's a rather obscure corner case. |
|||
| msg133055 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2011年04月05日 17:41 | |
New changeset 7d4dea76c476 by Ezio Melotti in branch '2.7': #7311: fix HTMLParser to accept non-ASCII attribute values. http://hg.python.org/cpython/rev/7d4dea76c476 |
|||
| msg133075 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2011年04月05日 18:51 | |
With 3.2 the situation is more complicated because there is a strict and a non-strict mode. The strict mode uses: attrfind = re.compile( r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*' r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?') and the tolerant mode uses: attrfind_tolerant = re.compile( r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*' r'(\'[^\']*\'|"[^"]*"|[^>\s]*))?') This means that the strict mode doesn't allow valid non-ASCII chars, and that tolerant mode is a little too permissive. The attached patch changes the strict regex to be more permissive and leaves the tolerant regex unchanged. The difference between the two are now so small that the tolerant version could be removed, except that re.search is used instead of re.match when the tolerant regex is used. |
|||
| msg133083 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2011年04月05日 19:38 | |
The goal of tolerant mode is to accept anything a typical browser would accept. I suspect that means the tolerant regex should stay, but I don't remember the details. As for the strict....as far as I know the current module follows 4.01, not 5. I'm not sure what should be done about that. |
|||
| msg133096 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2011年04月05日 22:37 | |
I don't see many use cases for the strict mode. It is not strict enough to be used for validation, and while parsing HTML I can't think of any other case where I would want an exception raised (always as long as what is parsed by the tolerant mode is a superset of what is parsed by the strict mode). If the parser is still able to parse what it was parsing before, I wouldn't worry too much about backward compatibility, because I can't imagine a valid use case where people would want the parser to fail (maybe someone else can?). |
|||
| msg133136 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2011年04月06日 14:31 | |
I think the stdlib should comply with HTML 4.01, and in the future HTML 5. (FTR, I don’t think XHTML is useful, and deny that XHTML-compatible HTML exists. See http://bugs.python.org/issue11567#msg131509 :) |
|||
| msg133145 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2011年04月06日 16:27 | |
I would agree if the HTMLParser was compliant with the HTML 4.01 specs, but since it's more permissive and uses its own heuristic to determine what should be parsed and what shouldn't, I think it's better to use already existing heuristics (either the HTML5 ones or the ones used by the browsers). I.e., I'm not trying to make it HTML5 compliant, just to make it work with what works on the browsers. |
|||
| msg133146 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2011年04月06日 16:32 | |
Okay, sounds good. |
|||
| msg133149 - (view) | Author: Senthil Kumaran (orsenthil) * (Python committer) | Date: 2011年04月06日 17:27 | |
We need not base changes to html/parser.py on html5 spec, but rather make changes based on the requirements on parsers which may rely on this library. Like the tolerant mode was brought in issue1486713 for some practical reasons and it was seen useful tor parsers. I don't know, how common is leaving out quotes for attributes is, but I think it can become really confusing to parsers (custom parsers). If we had not supported non-quote attributes I think, it is still okay still to not-to-support unless presented with case as very concrete bug. (like spec html 4.1 allows, which I see it does not). The patch which added support for non-ascii characters is fine. |
|||
| msg133174 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2011年04月06日 22:17 | |
So is the issue7311-3.diff patch fine? It changes the strict regex to match the 2.7 one, and leave the tolerant one unchanged (even if now the two regexs are really close). |
|||
| msg133185 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2011年04月07日 00:39 | |
Sounds fine to me. |
|||
| msg133188 - (view) | Author: Senthil Kumaran (orsenthil) * (Python committer) | Date: 2011年04月07日 01:27 | |
> So is the issue7311-3.diff patch fine? Just that it allows unquoted attrs for unicode too. My previous suggestion was not to allow unquoted attribute values, but as the change is already made in 2.7 and discussion pointed out a portion in 4.1 spec which allows unquoted attrs for ASCII, it seems fine. html/parse.py will be bit more permissive than what the spec says. > It changes the strict regex to match the 2.7 one, and leave the tolerant one unchanged. That is fine. |
|||
| msg133189 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2011年04月07日 02:03 | |
On 3.2 the patch changes only the range of chars matched by the regex when the attribute value doesn't have quotes and strict=True. The parser already allowed unquotes attribute values even before the patch (in both strict and tolerant mode), but used an explicit list of allowed chars that was limited to the ASCII range. |
|||
| msg133247 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2011年04月07日 19:27 | |
New changeset 225400cb6e84 by Ezio Melotti in branch '3.2': #7311: fix html.parser to accept non-ASCII attribute values. http://hg.python.org/cpython/rev/225400cb6e84 New changeset a1dea7cde58f by Ezio Melotti in branch 'default': #7311: merge with 3.2. http://hg.python.org/cpython/rev/a1dea7cde58f |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:54 | admin | set | github: 51560 |
| 2011年04月07日 19:29:03 | ezio.melotti | set | status: open -> closed resolution: fixed stage: commit review -> resolved |
| 2011年04月07日 19:27:23 | python-dev | set | messages: + msg133247 |
| 2011年04月07日 02:03:24 | ezio.melotti | set | messages: + msg133189 |
| 2011年04月07日 01:27:54 | orsenthil | set | messages: + msg133188 |
| 2011年04月07日 00:39:02 | r.david.murray | set | messages: + msg133185 |
| 2011年04月06日 22:17:12 | ezio.melotti | set | messages: + msg133174 |
| 2011年04月06日 17:27:57 | orsenthil | set | messages: + msg133149 |
| 2011年04月06日 16:32:23 | eric.araujo | set | messages: + msg133146 |
| 2011年04月06日 16:27:21 | ezio.melotti | set | messages: + msg133145 |
| 2011年04月06日 14:31:18 | eric.araujo | set | messages: + msg133136 |
| 2011年04月05日 22:37:11 | ezio.melotti | set | messages:
+ msg133096 stage: test needed -> commit review |
| 2011年04月05日 19:38:50 | r.david.murray | set | nosy:
+ orsenthil messages: + msg133083 |
| 2011年04月05日 18:51:55 | ezio.melotti | set | files:
+ issue7311-3.diff nosy: + r.david.murray messages: + msg133075 |
| 2011年04月05日 17:41:03 | python-dev | set | nosy:
+ python-dev messages: + msg133055 |
| 2011年04月03日 17:30:53 | ezio.melotti | set | files:
+ issue7311-2.diff messages: + msg132864 versions: - Python 3.1 |
| 2011年04月03日 15:17:44 | ezio.melotti | set | assignee: ezio.melotti |
| 2011年03月27日 13:57:21 | ezio.melotti | set | messages: + msg132321 |
| 2011年03月26日 10:17:23 | ezio.melotti | set | files:
+ issue7311.diff nosy: + belopolsky messages: + msg132223 keywords: + patch |
| 2011年03月21日 00:49:02 | eric.araujo | set | nosy:
+ eric.araujo versions: + Python 3.1, Python 3.2, Python 3.3, - Python 2.6 |
| 2009年11月20日 04:37:54 | pluskid | set | messages: + msg95529 |
| 2009年11月20日 02:59:28 | v+python | set | nosy:
+ v+python messages: + msg95527 |
| 2009年11月14日 01:23:19 | ezio.melotti | set | priority: normal nosy: + ezio.melotti versions: + Python 2.7 stage: test needed |
| 2009年11月13日 14:26:02 | fdrake | set | nosy:
+ fdrake |
| 2009年11月12日 16:25:42 | pluskid | create | |