Issue 513840: entity unescape for sgml/htmllib

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/36039

classification

Title:	entity unescape for sgml/htmllib
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.4

process

Status:	closed	Resolution:	duplicate
Dependencies:	Superseder:	expose html.parser.unescape View: 2927
Assigned To:	ezio.melotti	Nosy List:	BreamoreBoy, ezio.melotti, fdrake, glchapman
Priority:	normal	Keywords:	easy

Created on 2002年02月06日 17:55 by glchapman, last changed 2022年04月10日 16:04 by admin. This issue is now closed.

Messages (4)
msg61076 - (view)	Author: Greg Chapman (glchapman)	Date: 2002年02月06日 17:55
The parsers defined in htmllib and sgmllib do not provide any facilities for unescaping a tag attribute which has an embedded html entityref (i.e., they do not provide a way to convert "a&b" to "a&b"). The parser in HTMLParser unescapes all tag attributes automatically. I'm not sure that's the right approach for sgmllib and htmllib (since it might break existing code), but it seems to me that one of the modules ought to provide a function or method which can do the unescaping if needed. (I'm not familiar with either the SGML or the HTML specification, but I assume one of them mandates the escaping of '&' (e.g.) in tag attributes. If so, then it seems appropriate for one of the modules to provide a function which undoes the mandated transformation.)
msg61077 - (view)	Author: Fred Drake (fdrake) (Python committer)	Date: 2006年06月22日 03:57
Logged In: YES user_id=3066 This request is making me reconsider some other changes that have already been made on the trunk (and are now in 2.5b1). Reading this, I thought "Doesn't it already do that?" Turns out that in Python 2.4, it doesn't. Both versions handle this in parsed character data; the difference is confined to attribute values. I'd like to propose adding a Boolean configuration attribute on the parser instance that, when set, causes the parser to decode entity and character references. By default, it would be unset. This would support backward compatibility and make it easier to get attribute value decoding. Another possibility would be to revert the new feature and add a separate method to perform the decoding.
msg114175 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010年08月17日 21:41
Is anyone aware if this was implemented in 2.5 or later as hinted at in msg61077? If yes please close this. If no any point in putting this into 3.2?
msg185129 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2013年03月24日 11:33
See also #2927.

History
Date	User	Action	Args
2022年04月10日 16:04:57	admin	set	github: 36039
2013年11月18日 09:54:25	ezio.melotti	set	status: open -> closed assignee: ezio.melotti superseder: expose html.parser.unescape resolution: duplicate stage: test needed -> resolved
2013年03月24日 11:33:06	ezio.melotti	set	messages: + msg185129 versions: + Python 3.4, - Python 3.2
2013年03月23日 22:22:01	ezio.melotti	set	nosy: + ezio.melotti
2010年08月17日 21:41:06	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg114175 versions: + Python 3.2, - Python 2.7
2009年02月12日 20:03:12	ajaksu2	set	keywords: + easy stage: test needed versions: + Python 2.7
2002年02月06日 17:55:02	glchapman	create

homepage