PEP 393 vs UTF-8 Everywhere

Marko Rauhamaa marko at pacujo.net
Sat Jan 21 14:52:42 EST 2017


Pete Forman <petef4+usenet at gmail.com>:
> Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8
> and UTF-32.

Also, they don't exist as Unicode code points. Python shouldn't allow
surrogate characters in strings.
 Thus the range of code points that are available for use as
 characters is U+0000–U+D7FF and U+E000–U+10FFFF (1,112,064 code
 points).
 <URL: https://en.wikipedia.org/wiki/Unicode>
 The Unicode Character Database is basically a table of characters
 indexed using integers called ’code points’. Valid code points are in
 the ranges 0 to #xD7FF inclusive or #xE000 to #x10FFFF inclusive,
 which is about 1.1 million code points.
 <URL: https://www.gnu.org/software/guile/docs/master/guile.html/Char
 acters.html>
Guile does the right thing:
 scheme@(guile-user)> #\xd7ff
 1ドル = #153777円
 scheme@(guile-user)> #\xe000
 2ドル = #160000円
 scheme@(guile-user)> #\xd812
 While reading expression:
 ERROR: In procedure scm_lreadr: #<unknown port>:5:8: out-of-range hex c
 haracter escape: xd812
> py> low = '\uDC37'

That should raise a SyntaxError exception.
Marko


More information about the Python-list mailing list

AltStyle によって変換されたページ (->オリジナル) /