PEP 393 vs UTF-8 Everywhere

Steve D'Aprano steve+python at pearwood.info
Sun Jan 22 09:01:32 EST 2017


On 2017年1月22日 07:34 pm, Marko Rauhamaa wrote:
> Steve D'Aprano <steve+python at pearwood.info>:
>>> On 2017年1月22日 06:52 am, Marko Rauhamaa wrote:
>>> Also, [surrogates] don't exist as Unicode code points. Python
>>> shouldn't allow surrogate characters in strings.
>>>> Not quite. This is where it gets a bit messy and confusing. The bottom
>> line is: surrogates *are* code points, but they aren't *characters*.
>> All animals are equal, but some animals are more equal than others.

Huh?
>> Strings which contain surrogates are strictly speaking illegal,
>> although some programming languages (including Python) allow them.
>> Python shouldn't allow them.

That's one opinion.
>> The Unicode standard defines surrogates as follows:
>> [...]
>>>> - Surrogate Code Point. A Unicode code point in the range
>> U+D800..U+DFFF. Reserved for use by UTF-16,
>> The writer of the standard is playing word games, maybe to offer a fig
> leaf to Windows, Java et al.

Seriously?
>> By the letter of the Unicode standard, [Python] should not do this,
>> but nevertheless it does and it appears to do no real harm and have
>> some benefit.
>> I'm afraid Python's choice may lead to exploitable security holes in
> Python programs.

Feel free to back up that with an actual demonstration of an exploit, rather
than just FUD.
>>>> py> low = '\uDC37'
>>>>>> That should raise a SyntaxError exception.
>>>> If Python was strictly conforming, that is correct, but it turns out
>> there are some useful things you can do with strings if you allow
>> surrogates.
>> Conceptual confusion is a high price to pay for such tricks.

There's a lot to comprehend about Unicode. I don't see that Python's
non-strict implementation is harder to understand than the strict version.
-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.


More information about the Python-list mailing list

AltStyle によって変換されたページ (->オリジナル) /