This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2008年05月20日 05:43 by thomaspinckney3, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| unescape.diff | thomaspinckney3, 2008年05月20日 05:43 | review | ||
| unescape.diff | thomaspinckney3, 2010年11月09日 00:25 | review | ||
| issue2927.diff | ezio.melotti, 2013年11月18日 09:23 | review | ||
| issue2927-2.diff | ezio.melotti, 2013年11月19日 12:31 | review | ||
| issue2927-3.diff | ezio.melotti, 2013年11月19日 17:41 | review | ||
| Messages (14) | |||
|---|---|---|---|
| msg67102 - (view) | Author: Tom Pinckney (thomaspinckney3) * | Date: 2008年05月20日 05:43 | |
There is currently a private method inside of html.parser.HTMLParser to unescape HTML &...; style escapes. This would be useful to expose for other users who want to unescape a piece of HTML. Additionally, many websites don't use proper unicode or iso-8859-1 encodings and accidentally use Microsoft Code Page 1252 extensions. I added code to map these to their appropriate unicode values. The unescaping logic was slightly simplified too. This is my first Python patch submission, so please let me know if I've done anything wrong. A new test case was also added for this functionality. |
|||
| msg67103 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2008年05月20日 05:47 | |
The plan is to add html.escape(). Adding html.unescape() wouldn't hurt. |
|||
| msg110636 - (view) | Author: Mark Lawrence (BreamoreBoy) * | Date: 2010年07月18日 11:46 | |
Trying to run the test and I get:- c:\py3k\Lib>..\PCbuild\python_d.exe test\test_htmlparser.py File "test\test_htmlparser.py", line 326 escaped = u"<p>There’s the Côte</p>" ^ SyntaxError: invalid syntax |
|||
| msg110657 - (view) | Author: Reid Kleckner (rnk) (Python committer) | Date: 2010年07月18日 15:38 | |
It's using the old Python 2 unicode string literal syntax. It also doesn't keep to 80 cols. I'd also rather continue using a lazily initialized dict instead of catching a KeyError for '. I also feel that with the changes to Unicode in py3k, the cp1252 stuff won't work as desired and should be cut. === Is anyone still interested in html.unescape or html.escape anyway? Every web framework seems to have their own support routines already. Otherwise I'd recommend close -> wontfix. |
|||
| msg111461 - (view) | Author: Mark Lawrence (BreamoreBoy) * | Date: 2010年07月24日 11:48 | |
msg110657 recommends close -> wontfix. Does anybody want this kept open or can it be closed? |
|||
| msg111566 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年07月25日 22:30 | |
I'm not sure that using an hardcoded mapping CP1252 => unicode is a good idea. |
|||
| msg120269 - (view) | Author: Tom Pinckney (thomaspinckney3) * | Date: 2010年11月02日 22:32 | |
I don't think Django includes an HTML unescape. I'm not familiar with other frameworks. So I'd still find this useful to include in the stdlib. |
|||
| msg120826 - (view) | Author: Tom Pinckney (thomaspinckney3) * | Date: 2010年11月09日 00:25 | |
New patch attached, tested against Python 3.2. This is my first Python patch so apologies if I've done something wrong here. Feedback appreciated! Changes: * fit everything to 80 cols * just made changes to the HTMLParser.unescape function instead of providing a standalone unescape function * fixed test case to fix string literals to work in python 3k * left the cp1252 hacks in there since it looks like they work still, but if there's a problem with them let me know. In practice I have to this at work in order to make unescaping actual web pages work. |
|||
| msg203341 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年11月19日 07:39 | |
I added comments on Rietveld. Yet one thing. For now the html module is very simple and has no dependencies. The patch adds an import of re and html.escapes and relative heavy re.compile operations. Due to the fact that the html module is implicitly imported when any of the html submodules is imported, this can affect a code which doesn't use unescape(). However a cure for this problem (lazy import and initialization) may be worse than the problem itself, perhaps we should live with it. |
|||
| msg203369 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2013年11月19日 12:31 | |
Here's an updated patch that addresses comments on rietveld and adds a few more tests and docs. I should also update the what's new, but I have other upcoming changes in the html package so I'll probably do it at the end. Regarding your concern: * if people are only using html.escape, then they will get a couple of extra imports, including all the html5 entities, and a re.compile; * if people are using html.parser, they already have plenty of re.compiles there, and soon html.parser will use unescape too; * if people are using html.entities they only get an extra re.compile; Overall I don't think it's a big problem. As a side node, the "if '&' in s:" in the unescape function could be removed -- I'm not sure it brings any real advantage. This could/should be proved by benchmarks. |
|||
| msg203403 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2013年11月19日 17:41 | |
Here is the last iteration with a few minor tweaks and a couple more tests. |
|||
| msg203410 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年11月19日 18:27 | |
LGTM. |
|||
| msg203413 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2013年11月19日 18:29 | |
New changeset 7b9235852b3b by Ezio Melotti in branch 'default': #2927: Added the unescape() function to the html module. http://hg.python.org/cpython/rev/7b9235852b3b |
|||
| msg203415 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2013年11月19日 18:51 | |
Fixed, thanks for the reviews! |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:34 | admin | set | github: 47176 |
| 2013年11月19日 18:51:50 | ezio.melotti | set | status: open -> closed resolution: fixed messages: + msg203415 stage: commit review -> resolved |
| 2013年11月19日 18:29:15 | python-dev | set | nosy:
+ python-dev messages: + msg203413 |
| 2013年11月19日 18:27:13 | serhiy.storchaka | set | messages: + msg203410 |
| 2013年11月19日 17:41:17 | ezio.melotti | set | files:
+ issue2927-3.diff messages: + msg203403 stage: patch review -> commit review |
| 2013年11月19日 12:31:39 | ezio.melotti | set | files:
+ issue2927-2.diff messages: + msg203369 |
| 2013年11月19日 07:39:05 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg203341 |
| 2013年11月18日 10:40:10 | ezio.melotti | link | issue13633 dependencies |
| 2013年11月18日 09:54:25 | ezio.melotti | link | issue513840 superseder |
| 2013年11月18日 09:23:21 | ezio.melotti | set | files:
+ issue2927.diff assignee: ezio.melotti versions: + Python 3.4, - Python 3.2 |
| 2012年11月28日 03:37:33 | martin.panter | set | nosy:
+ martin.panter |
| 2010年11月09日 00:25:44 | thomaspinckney3 | set | files:
+ unescape.diff messages: + msg120826 |
| 2010年11月02日 22:32:55 | thomaspinckney3 | set | messages: + msg120269 |
| 2010年07月25日 22:30:19 | vstinner | set | messages: + msg111566 |
| 2010年07月24日 11:51:42 | eric.araujo | set | status: pending -> open nosy: + vstinner, ezio.melotti versions: + Python 3.2, - Python 2.7 |
| 2010年07月24日 11:48:33 | BreamoreBoy | set | status: open -> pending messages: + msg111461 |
| 2010年07月18日 15:38:58 | rnk | set | nosy:
+ rnk messages: + msg110657 |
| 2010年07月18日 11:46:56 | BreamoreBoy | set | nosy:
+ BreamoreBoy messages: + msg110636 |
| 2010年01月15日 20:32:20 | brian.curtin | set | priority: normal keywords: + needs review stage: patch review versions: + Python 2.7, - Python 2.6 |
| 2008年05月20日 05:47:55 | brett.cannon | set | nosy:
+ brett.cannon messages: + msg67103 |
| 2008年05月20日 05:43:53 | thomaspinckney3 | create | |