[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Wed Apr 29 08:04:52 CEST 2009

>> The Python UTF-8 codec will happily encode half-surrogates; people argue
>> that it is a bug that it does so, however, it would help in this
>> specific case.
>> Can we use this encoding scheme for writing into files as well? We've
> turned the filename with undecodable bytes into a string with half
> surrogates. Putting that string into a file has to turn them into bytes
> at some level. Can we use the python-escape error handler to achieve
> that somehow?

Sure: if you are aware that what you write to the stream is actually
a file name, you should encode it with the file system encoding, and
the python-escape handler. However, it's questionable that the same
approach is right for the rest of the data that goes into the file.
If you use a different encoding on the stream, yet still use the
python-escape handler, you may end up with completely non-sensical
bytes. In practice, it probably won't be that bad - python-escape
has likely escaped all non-ASCII bytes, so that on re-encoding with
a different encoding, only the ASCII characters get encoded, which
likely will work fine.
Regards,
Martin