Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5


On Mon, Jan 13, 2014 at 07:31:16AM +0900, Stephen J. Turnbull wrote:
> Steven D'Aprano writes:
> 
> > then the name is horribly misleading, and it is best handled like this:
> > 
> > content = '\n'.join([
> > 'header',
> > 'part 2 %.3f' % number,
> > binary_image_data.decode('latin-1'),
> > utf16_string, # Misleading name, actually Unicode string
> > 'trailer'])
> 
> This loses bigtime, as any encoding that can handle non-latin1 in
> utf16_string will corrupt binary_image_data. OTOH, latin1 will raise
> on non-latin1 characters. utf16_string must be encoded appropriately
> then decoded by latin1 to be reencoded by latin1 on output.
Of course you're right, but I have understood the above as being a 
sketch and not real code. (E.g. does "header" really mean the literal 
string "header", or does it stand in for something which is a header?) 
In real code, one would need to have some way of telling where the 
binary image data ends and the Unicode string begins.
If I have misunderstood the situation, then my apologies for compounding 
the error
[...]
> > Both examples assume that you intend to do further processing of content 
> > before sending it, and will encode just before sending:
> > 
> > content.encode('utf-8')
> > 
> > (Don't use Latin-1, since it cannot handle the full range of text 
> > characters.)
> 
> This corrupts binary_image_data. Each byte > 127 will be replaced by
> two bytes.
And reading it back using decode('utf-8') will replace those two bytes 
with a single byte, round-tripping exactly.
Of course if you encode to UTF-8 and then try to read the binary data as 
raw bytes, you'll get corrupted data. But do people expect to do this? 
That's a genuine question -- again, I assumed (apparently wrongly) that 
the idea was to write the content out as *text* containing smuggled 
bytes, and read it back the same way.
> In the second case, you can use latin1 to encode, it it
> gives you what you want.
> 
> This kind of subtlety is precisely why MAL warned about use of latin1
> to smuggle bytes.
How would you smuggle a chunk of arbitrary bytes into a text string? 
Short of doing something like uuencoding it into ASCII, or equivalent.
-- 
Steven
_______________________________________________
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

Reply via email to