[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Tue Apr 28 23:01:44 CEST 2009

Glenn Linderman wrote:
> On approximately 4/28/2009 11:55 AM, came the following characters from 
> the keyboard of MRAB:
>> I've been thinking of "python-escape" only in terms of UTF-8, the only
>> encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are
>> decodable.
>>> UTF-8 is only mentioned in the sense of having special handling for 
> re-encoding; all the other locales/encodings are implicit. But I also 
> went down that path to some extent.
>>>> But if you're talking about using it with other encodings, eg
>> shift-jisx0213, then I'd suggest the following:
>>>> 1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to
>> half surrogates U+DC00 to U+DCFF.
>>> This makes 256 different escape codes.
>>Speaking personally, I won't call them 'escape codes'. I'd use the term
'escape code' to mean a character that changes the interpretation of the
next character(s).
>> 2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF
>> are treated as though they are undecodable bytes.
>>> This provides escaping for the 256 different escape codes, which is 
> lacking from the PEP.
>>>> 3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding
>> are encoded to bytes 0x00 to 0xFF.
>>> This reverses the escaping.
>>>> 4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't
>> be produced by decoding raise an exception.
>>> This is confusing. Did you mean "excluding" instead of "including"?
>Perhaps I should've said "Any codepoint which can't be produced by
decoding should raise an exception".
For example, decoding with UTF-8b will never produce U+DC00, therefore
attempting to encode U+DC00 should raise an exception and not produce
0x00.
>>> I think I've covered all the possibilities. :-)
>>> You might have. Seems like there could be a simpler scheme, though...
>> 1. Define an escape codepoint. It could be U+003F or U+DC00 or U+F817 
> or pretty much any defined Unicode codepoint outside the range U+0100 to 
> U+01FF (see rule 3 for why). Only one escape codepoint is needed, this 
> is easier for humans to comprehend.
>> 2. When the escape codepoint is decoded from the byte stream for a bytes 
> interface or found in a str on the str interface, double it.
>> 3. When an undecodable byte 0xPQ is found, decode to the escape 
> codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.
>> 4. When encoding, a sequence of two escape codepoints would be encoded 
> as one escape codepoint, and a sequence of the escape codepoint followed 
> by codepoint U+01PQ would be encoded as byte 0xPQ. Escape codepoints 
> not followed by the escape codepoint, or by a codepoint in the range 
> U+0100 to U+01FF would raise an exception.
>> 5. Provide functions that will perform the same decoding and encoding as 
> would be done by the system calls, for both bytes and str interfaces.
>>> This differs from my previous proposal in three ways:
>> A. Doesn't put a marker at the beginning of the string (which I said 
> wasn't necessary even then).
>> B. Allows for a choice of escape codepoint, the previous proposal 
> suggested a specific one. But the final solution will only have a 
> single one, not a user choice, but an implementation choice.
>> C. Uses the range U+0100 to U+01FF for the escape codes, rather than 
> U+0000 to U+00FF. This avoids introducing the NULL character and escape 
> characters into the decoded str representation, yet still uses 
> characters for which glyphs are commonly available, are non-combining, 
> and are easily distinguishable one from another.
>> Rationale:
>> The use of codepoints with visible glyphs makes the escaped string 
> friendlier to display systems, and to people. I still recommend using 
> U+003F as the escape codepoint, but certainly one with a typcially 
> visible glyph available. This avoids what I consider to be an annoyance 
> with the PEP, that the codepoints used are not ones that are easily 
> displayed, so endecodable names could easily result in long strings of 
> indistinguishable substitution characters.
>Perhaps the escape character should be U+005C. ;-)
> It, like MRAB's proposal, also avoids data puns, which is a major 
> problem with the PEP. I consider this proposal to be easier to 
> understand than MRAB's proposal, or the PEP, because of the single 
> escape codepoint and the use of visible characters.
>> This proposal, like my initial one, also decodes and encodes (just the 
> escape codes) values on the str interfaces. This is necessary to avoid 
> data puns on systems that provide both types of interfaces.
>> This proposal could be used for programs that use str values, and easily 
> migrates to a solution that provides an object that provides an 
> abstraction for system interfaces that have two forms.
>