PEP 393 vs UTF-8 Everywhere

Sat Jan 21 15:49:26 EST 2017

On Sat, Jan 21, 2017 at 8:21 PM, Pete Forman <petef4+usenet at gmail.com> wrote:
> Marko Rauhamaa <marko at pacujo.net> writes:
>>>> py> low = '\uDC37'
>>>> That should raise a SyntaxError exception.
>> Quite. My point was that with older Python on a narrow build (Windows
> and Mac) you need to understand that you are using UTF-16 rather than
> Unicode. On a wide build or Python 3.3+ then all is rosy. (At this point
> I'm tempted to put in a winky emoji but that might push the internal
> representation into UCS-4.)

CPython allows surrogate codes for use with the "surrogateescape" and
"surrogatepass" error handlers, which are used for POSIX and Windows
file-system encoding, respectively. Maybe MicroPython goes about the
file-system round-trip problem differently, or maybe it just require
using bytes for file-system and environment-variable names on POSIX
and doesn't care about Windows.
"surrogateescape" allows 'decoding' arbitrary bytes:
 >>> b'\x81'.decode('ascii', 'surrogateescape')
 '\udc81'
 >>> '\udc81'.encode('ascii', 'surrogateescape')
 b'\x81'
This error handler is required by CPython on POSIX to handle arbitrary
bytes in file-system paths. For example, when running with LANG=C:
 >>> sys.getfilesystemencoding()
 'ascii'
 >>> os.listdir(b'.')
 [b'\x81']
 >>> os.listdir('.')
 ['\udc81']
"surrogatepass" allows encoding surrogates:
 >>> '\udc81'.encode('utf-8', 'surrogatepass')
 b'\xed\xb2\x81'
 >>> b'\xed\xb2\x81'.decode('utf-8', 'surrogatepass')
 '\udc81'
This error handler is used by CPython 3.6+ to encode Windows UCS-2
file-system paths as WTF-8 (Wobbly). For example:
 >>> os.listdir('.')
 ['\udc81']
 >>> os.listdir(b'.')
 [b'\xed\xb2\x81']