homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ezio.melotti
Recipients eric.araujo, ezio.melotti
Date 2012年02月23日.02:38:48
SpamBayes Score 5.929609e-07
Marked as misclassified No
Message-id <1329964729.93.0.382093206807.issue13633@psf.upfronthosting.co.za>
In-reply-to
Content
This behavior is now documented, but the situation could still be improved. Adding a new method that receives the converted entity seems a good way to handle this. The parser can call both, and users can pick either one.
One problem with the current methods (handle_charref and handle_entityref) is that they don't do any processing on the entity and let invalid character references like &#x1000000000; or &#iamnotanentity; go through.
There are at least 3 changes that should be done in order to follow the HTML5 standard [0]:
 1) the parser should look at html.entities while parsing named character references (see also #11113). This will allow the parser to parse &notit; as "¬it;" and &notin; as "∉" (see note at the very end of [0]);
 2) invalid character references (e.g. &#x1000000000;, &#iamnotanentity;) should not go through;
 3) the table at [0] with the replacement character should be used by the parser to "correct" those invalid character references (e.g. 0x80 -> U+20AC);
Now, 1) can be done for both the old and new method, but for 2) and 3) the situation is a bit more complicated. The best thing is probably to keep sending them unchanged to the old methods, and implement the correct behavior for the new method only.
[0]: http://www.w3.org/TR/html5/tokenization.html#tokenizing-character-references 
History
Date User Action Args
2012年02月23日 02:38:49ezio.melottisetrecipients: + ezio.melotti, eric.araujo
2012年02月23日 02:38:49ezio.melottisetmessageid: <1329964729.93.0.382093206807.issue13633@psf.upfronthosting.co.za>
2012年02月23日 02:38:49ezio.melottilinkissue13633 messages
2012年02月23日 02:38:48ezio.melotticreate

AltStyle によって変換されたページ (->オリジナル) /