Message184490
| Author |
gward |
| Recipients |
barry, durin42, gward, ncoghlan, r.david.murray, terry.reedy |
| Date |
2013年03月18日.18:50:26 |
| SpamBayes Score |
-1.0 |
| Marked as misclassified |
Yes |
| Message-id |
<1363632627.1.0.872503303943.issue17445@psf.upfronthosting.co.za> |
| In-reply-to |
| Content |
Replying to Terry Reedy:
> So a dual string/bytes function would not be completely trivial.
Correct. I have one working, but it makes my eyes bleed. I fail ashamed to have written it.
> Greg, can you convert bytes to strings, or strings to bytes
Nope. Here is the hypothetical use case: I have a text file written in Polish encoded in ISO-8859-1 committed to a Mercurial repository. (Or saved in a filesystem somewhere: doesn't really matter, except that Mercurial repositories are immutable, long-term, and *must* *not* *lose* *data*.) Then I decide I should play nicely with the rest of the world and transcode to UTF-8, so commit a new rev in UTF-8.
Years later, I need to look at the diff between those two old revisions. Rev 1 is a pile of ISO-8859-2 bytes, and rev 2 is a pile of UTF-8 bytes. The output of diff looks like
- blah blah [iso-8859-2 bytes] blah
+ blah blah [utf-8 bytes] blah
Note this: the output of diff has some lines that are iso-8859-2 bytes and some that are utf-8 bytes. *There is no single encoding* that applies.
Note also that diff output must contain the exact original bytes, so that it can be consumed by patch. Diffs are read both by humans and by machines.
> Otherwise, I think it might be better to write a new function
> 'unified_diff_bytes' that did exactly what you want than to try to
> make unified_diff accept sequences of bytes.
Good idea. That might be much less revolting than what I have now. I'll give it a shot. |
|