Message 184490 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	gward
Recipients	barry, durin42, gward, ncoghlan, r.david.murray, terry.reedy
Date	2013年03月18日.18:50:26
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1363632627.1.0.872503303943.issue17445@psf.upfronthosting.co.za>

Content
Replying to Terry Reedy: > So a dual string/bytes function would not be completely trivial. Correct. I have one working, but it makes my eyes bleed. I fail ashamed to have written it. > Greg, can you convert bytes to strings, or strings to bytes Nope. Here is the hypothetical use case: I have a text file written in Polish encoded in ISO-8859-1 committed to a Mercurial repository. (Or saved in a filesystem somewhere: doesn't really matter, except that Mercurial repositories are immutable, long-term, and must not lose data.) Then I decide I should play nicely with the rest of the world and transcode to UTF-8, so commit a new rev in UTF-8. Years later, I need to look at the diff between those two old revisions. Rev 1 is a pile of ISO-8859-2 bytes, and rev 2 is a pile of UTF-8 bytes. The output of diff looks like - blah blah [iso-8859-2 bytes] blah + blah blah [utf-8 bytes] blah Note this: the output of diff has some lines that are iso-8859-2 bytes and some that are utf-8 bytes. There is no single encoding that applies. Note also that diff output must contain the exact original bytes, so that it can be consumed by patch. Diffs are read both by humans and by machines. > Otherwise, I think it might be better to write a new function > 'unified_diff_bytes' that did exactly what you want than to try to > make unified_diff accept sequences of bytes. Good idea. That might be much less revolting than what I have now. I'll give it a shot.

Content

Replying to Terry Reedy:
> So a dual string/bytes function would not be completely trivial.
Correct. I have one working, but it makes my eyes bleed. I fail ashamed to have written it.
> Greg, can you convert bytes to strings, or strings to bytes
Nope. Here is the hypothetical use case: I have a text file written in Polish encoded in ISO-8859-1 committed to a Mercurial repository. (Or saved in a filesystem somewhere: doesn't really matter, except that Mercurial repositories are immutable, long-term, and *must* *not* *lose* *data*.) Then I decide I should play nicely with the rest of the world and transcode to UTF-8, so commit a new rev in UTF-8.
Years later, I need to look at the diff between those two old revisions. Rev 1 is a pile of ISO-8859-2 bytes, and rev 2 is a pile of UTF-8 bytes. The output of diff looks like
 - blah blah [iso-8859-2 bytes] blah
 + blah blah [utf-8 bytes] blah
Note this: the output of diff has some lines that are iso-8859-2 bytes and some that are utf-8 bytes. *There is no single encoding* that applies.
Note also that diff output must contain the exact original bytes, so that it can be consumed by patch. Diffs are read both by humans and by machines.
> Otherwise, I think it might be better to write a new function 
> 'unified_diff_bytes' that did exactly what you want than to try to 
> make unified_diff accept sequences of bytes.
Good idea. That might be much less revolting than what I have now. I'll give it a shot.

History
Date	User	Action	Args
2013年03月18日 18:50:27	gward	set	recipients: + gward, barry, terry.reedy, ncoghlan, durin42, r.david.murray
2013年03月18日 18:50:27	gward	set	messageid: <1363632627.1.0.872503303943.issue17445@psf.upfronthosting.co.za>
2013年03月18日 18:50:27	gward	link	issue17445 messages
2013年03月18日 18:50:26	gward	create

homepage