Issue 11113: html.entities mapping dicts need updating?

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/55322

classification

Title:	html.entities mapping dicts need updating?
Type:	enhancement	Stage:	resolved
Components:	Library (Lib), Unicode, XML	Versions:	Python 3.3

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	ezio.melotti	Nosy List:	Brian.Jones, eric.araujo, eric.smith, ezio.melotti, hp.dekoning, loewis, python-dev
Priority:	normal	Keywords:	patch

Created on 2011年02月04日 03:43 by Brian.Jones, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
entities_dict.py	ezio.melotti, 2011年11月29日 08:42	dict with the HTML5 entities
entities.py	ezio.melotti, 2012年06月23日 16:13	dict ('name;': 'str';) with the 2231 HTML5 entities
issue11113.diff	ezio.melotti, 2012年06月23日 16:53	review
issue11113-2.diff	ezio.melotti, 2012年06月23日 18:31	review

Messages (22)
msg127865 - (view)	Author: Brian Jones (Brian.Jones) *	Date: 2011年02月04日 03:43
In Python 3.2b2, html.entities.codepoint2name and name2codepoint only support the 252 HTML entity names defined in the HTML 4 spec from 1997. I'm wondering if there's a reason not to support W3C Recommendation 'XML Entity Definitions for Characters' http://www.w3.org/TR/xml-entity-names/ This standard contains significantly more characters, and it is noted in that spec that the HTML 5 drafts use that spec's entities. You can see the current HTML 5 'Named character references' here: http://www.w3.org/TR/html5/named-character-references.html#named-character-references If this is just a matter of somebody going in to do the grunt work, let me know. If startup costs associated with importing a huge dictionary are a concern, perhaps a more efficient type that enables the same lookup interface can be defined. If other reasons exist to not move in this direction, please do let me know!
msg127873 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2011年02月04日 08:33
Supporting the ones in HTML 5 would be fine with me. Supporting those of xml-entity-names would be inappropriate - it's not clear (to me, at least) that all of them are really meant for use in HTML.
msg127911 - (view)	Author: Éric Araujo (eric.araujo) * (Python committer)	Date: 2011年02月04日 17:40
Agreed with Martin. I wonder if we should provide a means to use only HTML 4.01 entity references (say with a function parameter html5 defaulting to True) or we should just update the mapping.
msg128080 - (view)	Author: Eric V. Smith (eric.smith) * (Python committer)	Date: 2011年02月06日 20:06
I don't see the need for a parameter to support different sets of entities. Just supporting the ones from HTML 5 seems like the right thing.
msg128081 - (view)	Author: Éric Araujo (eric.araujo) * (Python committer)	Date: 2011年02月06日 20:07
To make my intent explicit: an updated mapping could generate references invalid for 4.01.
msg128082 - (view)	Author: Eric V. Smith (eric.smith) * (Python committer)	Date: 2011年02月06日 20:08
Ah. I hadn't thought of generating them, only parsing them. In that case, then yes, it's an issue for generation.
msg138318 - (view)	Author: Éric Araujo (eric.araujo) * (Python committer)	Date: 2011年06月14日 14:57
I just closed #12329 as a duplicate of this bug. It requested the addition of the apos named entity reference. TTBOMK, the html module (or htmlentitydefs in 2.x) doesn’t claim to support XHTML TTBOMK; an XML parser should be used for XHTML. In HTML 4.01, apos is not defined, but it is in HTML5.
msg138349 - (view)	Author: Hans Peter de Koning (hp.dekoning)	Date: 2011年06月14日 21:02
The reason I raised #12329 was that the v2.7.1 documentation in http://docs.python.org/library/htmllib.html#module-htmlentitydefs says: "... The definition provided here contains all the entities defined by XHTML 1.0 ..." The only diff between the 252 HTML 4.01 and 253 XHTML 1.0 entities is "apos". See http://www.w3.org/TR/html401/sgml/entities.html and http://www.w3.org/TR/xhtml1/dtds.html .
msg138351 - (view)	Author: Hans Peter de Koning (hp.dekoning)	Date: 2011年06月14日 21:28
BTW, the HTMLParser module (as well as html.parser in 3.x) does claim to parse both HTML and XHTML, see http://docs.python.org/library/htmlparser.html#module-HTMLParser .
msg138366 - (view)	Author: Éric Araujo (eric.araujo) * (Python committer)	Date: 2011年06月15日 13:38
Ah, this changes the situation. I suppose it’s too late to stop pretending that HTML and XHTML are nearly the same thing (IOW change the doc), so apos needs to be defined for XHTML. IMO, we need a way to have the right entity references for HTML 4.01, XHTML 1.0 and HTML5, not put them all in one mapping.
msg140783 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2011年07月21日 05:22
Having them in different mappings would be good, but I expect that for most real world application a single mappings that includes them all is the way to go. If I'm parsing a supposedly HTML page that contains an ' I'd rather have it converted even if it's not an HTML entity. If the set of entities supported by HTML5 is a superset of the HTML4 and XHTML ones, than we might just use that (I haven't checked though).
msg148549 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2011年11月29日 08:42
http://www.w3.org/TR/html5/named-character-references.html lists 2152 HTML 5 entities (see also attached file for a dict generated from that table). Currently html.entities only has 252 entities, organized in 3 dicts: 1) name -> intvalue (e.g. 'amp': 0x0026); 2) intvalue -> name (e.g. 0x0026: 'amp'); 3) name -> char (e.g. 'amp': '&'); In HTML 5, some of the entities map to a sequence of 2 characters, for example &NotEqualTilde; corresponds to [U+2242, U+0338] (i.e. MINUS TILDE + COMBINING LONG SOLIDUS OVERLAY). This means that: 1) the current approach of having a dict with name -> intvalue doesn't work anymore, and a name -> valuelist should be used instead; 2) the reverse dict for this would have to use tuples as keys, but I'm not sure how useful would that be (producing entities is not a common case, especially "unusual" ones like these). 3) The name -> char dict might still be useful, and can easily become a name -> str dict in order to deal with the multichar entities; Since 1) is not backward-compatible the HTML5 entities should probably go in a separate dict. Also note that the entities are case-sensitive and some of them include different spellings (e.g. both 'amp' and 'AMP' map to '&'), so the reverse dict won't work too well. Having '&' -> 'amp' seems better than '&' -> 'AMP', but this might not be obvious for all the entities and requires some extra logic in the code to get it right.
msg148615 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2011年11月29日 21:24
> 1) the current approach of having a dict with name -> intvalue doesn't work anymore, and a name -> valuelist should be used instead; > 2) the reverse dict for this would have to use tuples as keys, but I'm not sure how useful would that be (producing entities is not a common case, especially "unusual" ones like these). > 3) The name -> char dict might still be useful, and can easily become a name -> str dict in order to deal with the multichar entities; > > Since 1) is not backward-compatible the HTML5 entities should probably go in a separate dict. +1 for a separate dict; -1 for a value list. The right value type is 'str'; name2codepoint ought to be deprecated (it's a left-over from when the str type wasn't unicode in 2.x). As for the reverse mapping: I'd add a dictionary that is reverse to entitydefs (i.e. with str keys). That some keys then have two characters is no real issue: applications that want to use this dictionary can either ignore them, or follow the approach of always checking Unicode combining characters - I'd expect that all "second" characters are indeed combining. OTOH, it's easy enough to create an inverted dictionary yourself when you need it, and not every three-line function needs to be in the standard library. It might actually be more useful to compile the values into a regular expression which you can then use to find out whether characters can be escaped using entity references.
msg163634 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2012年06月23日 16:13
Attached another file with a dict that contains the 2231 HTML5 entities listed at http://www.w3.org/TR/html5/named-character-references.html The dict is like: html5namedcharref = { 'Aacute;': '\xc1', 'Aacute': '\xc1', 'aacute;': '\xe1', 'aacute': '\xe1', 'Abreve;': '\u0102', 'abreve;': '\u0103', ... } A better name could be found for the dict if you have better ideas (maybe html.entities.html5 only?). The dict will be added to html.entities.
msg163641 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2012年06月23日 16:39
Here is a proper patch, still using the html5namedcharref name. HTMLParser should also be updated to use this dict.
msg163654 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2012年06月23日 18:26
How about calling it just "html5", or "HTML5"? That it is about entities already follows from the module name.
msg163656 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2012年06月23日 18:31
Here's a new patch that uses the "html5" name for the dict, if there aren't other comments I'll commit it.
msg163701 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2012年06月24日 02:37
New changeset 2b54e25d6ecb by Ezio Melotti in branch 'default': #11113: add a new "html5" dictionary containing the named character references defined by the HTML5 standard and the equivalent Unicode character(s) to the html.entities module. http://hg.python.org/cpython/rev/2b54e25d6ecb
msg163704 - (view)	Author: Éric Araujo (eric.araujo) * (Python committer)	Date: 2012年06月24日 02:59
The ';' is not part of the entity name but an SGML delimiter, like '&'; the strings in the dict should not include it (like in the other dict they don’t).
msg163705 - (view)	Author: Éric Araujo (eric.araujo) * (Python committer)	Date: 2012年06月24日 03:04
BTW in the doc you may point to collections.ChainMap to explain to people how to make one dict with HTML 4 and HTML 5 entities. (Note that I assume there are two dicts, but I only skimmed the diff.)
msg163706 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2012年06月24日 03:11
The problem is that the standard allows some charref to end without a ';', but not all of them. So both "&Eacuteric" and Éric" will be parsed as "Éric", but only "αcentauri" will result in "αcentauri" -- "&alphacentauri" will be returned unchanged. I'm now working on #15156 to use this dict in HTMLParser, and detecting the ';'-less entities is not easy. A possible solution is to keep the names that are accepted without ',' in a separate (private) dict and expose a function like HTMLParser.unescape that implements all the necessary logic. Regarding ChainMap, the html5 dict should be a superset of the html4 one.
msg163707 - (view)	Author: Éric Araujo (eric.araujo) * (Python committer)	Date: 2012年06月24日 03:32
The explanations make sense, don’t change anything.

History
Date	User	Action	Args
2022年04月11日 14:57:12	admin	set	github: 55322
2012年06月24日 03:32:54	eric.araujo	set	messages: + msg163707
2012年06月24日 03:11:35	ezio.melotti	set	messages: + msg163706
2012年06月24日 03:04:30	eric.araujo	set	messages: + msg163705
2012年06月24日 02:59:50	eric.araujo	set	messages: + msg163704
2012年06月24日 02:40:13	ezio.melotti	set	status: open -> closed resolution: fixed stage: commit review -> resolved
2012年06月24日 02:37:51	python-dev	set	nosy: + python-dev messages: + msg163701
2012年06月23日 18:31:59	ezio.melotti	set	files: + issue11113-2.diff messages: + msg163656
2012年06月23日 18:26:25	loewis	set	messages: + msg163654
2012年06月23日 16:53:06	ezio.melotti	set	files: + issue11113.diff
2012年06月23日 16:52:54	ezio.melotti	set	files: - issue11113.diff
2012年06月23日 16:39:24	ezio.melotti	set	files: + issue11113.diff keywords: + patch messages: + msg163641 stage: patch review -> commit review
2012年06月23日 16:13:58	ezio.melotti	set	files: + entities.py messages: + msg163634 stage: needs patch -> patch review
2012年02月23日 02:38:49	ezio.melotti	link	issue13633 dependencies
2011年11月29日 21:24:22	loewis	set	messages: + msg148615
2011年11月29日 08:43:03	ezio.melotti	set	files: + entities_dict.py messages: + msg148549
2011年11月29日 06:10:33	ezio.melotti	set	assignee: ezio.melotti
2011年07月21日 05:22:23	ezio.melotti	set	messages: + msg140783
2011年06月15日 13:38:38	eric.araujo	set	messages: + msg138366
2011年06月14日 22:50:27	ezio.melotti	set	nosy: + ezio.melotti
2011年06月14日 21:28:18	hp.dekoning	set	messages: + msg138351
2011年06月14日 21:02:56	hp.dekoning	set	nosy: + hp.dekoning messages: + msg138349
2011年06月14日 14:57:15	eric.araujo	set	messages: + msg138318
2011年06月14日 14:55:36	eric.araujo	link	issue12329 superseder
2011年02月06日 20:08:41	eric.smith	set	nosy: loewis, eric.smith, eric.araujo, Brian.Jones messages: + msg128082
2011年02月06日 20:07:02	eric.araujo	set	nosy: loewis, eric.smith, eric.araujo, Brian.Jones messages: + msg128081
2011年02月06日 20:06:06	eric.smith	set	nosy: + eric.smith messages: + msg128080
2011年02月04日 17:40:00	eric.araujo	set	versions: + Python 3.3, - Python 3.2 nosy: + eric.araujo messages: + msg127911 stage: needs patch
2011年02月04日 08:33:12	loewis	set	nosy: + loewis messages: + msg127873
2011年02月04日 03:43:54	Brian.Jones	create

homepage