Message257074
| Author |
lemburg |
| Recipients |
doerwalter, lemburg, serhiy.storchaka, terry.reedy, vstinner, 王杰 |
| Date |
2015年12月27日.12:33:05 |
| SpamBayes Score |
-1.0 |
| Marked as misclassified |
Yes |
| Message-id |
<567FDA7E.5020405@egenix.com> |
| In-reply-to |
<1451178315.58.0.0417010168097.issue25937@psf.upfronthosting.co.za> |
| Content |
On 27.12.2015 02:05, Serhiy Storchaka wrote:
>
>> I wonder why this does not trigger the exception.
>
> Because in case of utf-8 and iso-8859-1 decoding and encoding steps are omitted.
>
> In general case the input is decoded from specified encoding and than encoded to UTF-8 for parser. But for utf-8 and iso-8859-1 encodings the parser gets the raw data.
Right, but since the tokenizer doesn't know about "utf8" it
should reach out to the codec registry to get a properly encoded
version of the source code (even though this is an unnecessary
round-trip).
There are few other aliases for UTF-8 which would likely trigger
the same problem:
# utf_8 codec
'u8' : 'utf_8',
'utf' : 'utf_8',
'utf8' : 'utf_8',
'utf8_ucs2' : 'utf_8',
'utf8_ucs4' : 'utf_8', |
|