Timeline for Bytes in a unicode Python string
Current License: CC BY-SA 3.0
20 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Nov 21, 2020 at 9:01 | answer | added | Tahirhan | timeline score: 0 | |
| Nov 27, 2016 at 23:45 | comment | added | ShreevatsaR |
Just to be clear: the presence of \xNN escape sequences does not mean they're UTF-8 bytes. Python represents Unicode code points in the range 0 to 7F by \x escape sequences (other than printable ascii characters and \n \t etc). See code here. Try this code: for n in range(300): print hex(n), repr(unichr(n)). For example, the character Ð (U+00D0) will be represented by \xd0 rather than \u+00d0 even in a Unicode string.
|
|
| Mar 24, 2012 at 1:03 | answer | added | Mark Tolonen | timeline score: 5 | |
| Mar 23, 2012 at 22:11 | answer | added | georg | timeline score: 13 | |
| Mar 23, 2012 at 21:39 | comment | added | Etienne Perot |
@thg435 Nah, that's just because I took a substring of a word to keep this example string short enough (Full string was Стандартный Захват Контрольных Точек)
|
|
| Mar 23, 2012 at 21:39 | comment | added | Karl Knechtel | @tchrist indeed, if the data is indeed on disk nominally UTF-8 encoded, OP may be looking at a hopefully rare case of double-UTF. ;) | |
| Mar 23, 2012 at 21:35 | comment | added | georg | BTW, "Русский ек" doesn't seem to be valid either, it probably should read "Русский язык" (=Russian language), so I guess there's more than that broken. | |
| Mar 23, 2012 at 21:24 | comment | added | Winston Ewert | I think your best bet is to figure out how such a crazy string was generated in the first place. Only then can you figure out the best way to fix it. You maybe be able to avoid modifying the code responsible, but you probably can't avoid understand it. | |
| Mar 23, 2012 at 21:22 | vote | accept | Etienne Perot | ||
| Mar 23, 2012 at 21:20 | comment | added | tchrist | What’s happening is that the second part of your string has been double-encoded, which causes it to appear to have a bunch of code points < 255, which interpreted as UTF-8 give the right value. | |
| Mar 23, 2012 at 20:58 | comment | added | Etienne Perot |
Well then, I'm not really sure what that string is anymore... It is an object which representation starts with u (like unicode strings do) and which contains both \uXXXX's (like unicode strings do) and \xXX's (like byte strings do). All sequences of \xXX's in the representation of the object also happen to be valid UTF-8 byte strings if they were byte strings (which they're not, because they're contained inside the unicode string). Not sure if that makes more sense, but I hope it did
|
|
| Mar 23, 2012 at 20:53 | comment | added | Mark Ransom | @EtiennePerot, if you're starting with a UTF-8 byte sequence then please add it to the question. What you've shown us is a Unicode string which is NOT THE SAME! | |
| Mar 23, 2012 at 20:53 | answer | added | Karl Knechtel | timeline score: 23 | |
| Mar 23, 2012 at 20:52 | comment | added | Mark Ransom | @NiklasB. is right - the UTF-8 encoded bytes are also valid Unicode codepoints so there's no way to tell what's what reliably. | |
| Mar 23, 2012 at 20:51 | comment | added | Etienne Perot | All the bytes in the input data are all UTF-8-encoded characters, so I think it is safe to assume that every sequence of bytes in the initial string can be safely decoded from UTF-8 | |
| Mar 23, 2012 at 20:50 | vote | accept | Etienne Perot | ||
| Mar 23, 2012 at 20:58 | |||||
| Mar 23, 2012 at 20:48 | comment | added | Niklas B. | There's no reliable way to solve this because the input data doesn't contain enough information in the first place. | |
| Mar 23, 2012 at 20:43 | answer | added | kev | timeline score: 5 | |
| Mar 23, 2012 at 20:42 | answer | added | beerbajay | timeline score: 12 | |
| Mar 23, 2012 at 20:05 | history | asked | Etienne Perot | CC BY-SA 3.0 |