Skip to main content
Stack Overflow
  1. About
  2. For Teams

Timeline for Bytes in a unicode Python string

Current License: CC BY-SA 3.0

20 events
when toggle format what by license comment
Nov 21, 2020 at 9:01 answer added Tahirhan timeline score: 0
Nov 27, 2016 at 23:45 comment added ShreevatsaR Just to be clear: the presence of \xNN escape sequences does not mean they're UTF-8 bytes. Python represents Unicode code points in the range 0 to 7F by \x escape sequences (other than printable ascii characters and \n \t etc). See code here. Try this code: for n in range(300): print hex(n), repr(unichr(n)). For example, the character Ð (U+00D0) will be represented by \xd0 rather than \u+00d0 even in a Unicode string.
Mar 24, 2012 at 1:03 answer added Mark Tolonen timeline score: 5
Mar 23, 2012 at 22:11 answer added georg timeline score: 13
Mar 23, 2012 at 21:39 comment added Etienne Perot @thg435 Nah, that's just because I took a substring of a word to keep this example string short enough (Full string was Стандартный Захват Контрольных Точек)
Mar 23, 2012 at 21:39 comment added Karl Knechtel @tchrist indeed, if the data is indeed on disk nominally UTF-8 encoded, OP may be looking at a hopefully rare case of double-UTF. ;)
Mar 23, 2012 at 21:35 comment added georg BTW, "Русский ек" doesn't seem to be valid either, it probably should read "Русский язык" (=Russian language), so I guess there's more than that broken.
Mar 23, 2012 at 21:24 comment added Winston Ewert I think your best bet is to figure out how such a crazy string was generated in the first place. Only then can you figure out the best way to fix it. You maybe be able to avoid modifying the code responsible, but you probably can't avoid understand it.
Mar 23, 2012 at 21:22 vote accept Etienne Perot
Mar 23, 2012 at 21:20 comment added tchrist What’s happening is that the second part of your string has been double-encoded, which causes it to appear to have a bunch of code points < 255, which interpreted as UTF-8 give the right value.
Mar 23, 2012 at 20:58 comment added Etienne Perot Well then, I'm not really sure what that string is anymore... It is an object which representation starts with u (like unicode strings do) and which contains both \uXXXX's (like unicode strings do) and \xXX's (like byte strings do). All sequences of \xXX's in the representation of the object also happen to be valid UTF-8 byte strings if they were byte strings (which they're not, because they're contained inside the unicode string). Not sure if that makes more sense, but I hope it did
Mar 23, 2012 at 20:53 comment added Mark Ransom @EtiennePerot, if you're starting with a UTF-8 byte sequence then please add it to the question. What you've shown us is a Unicode string which is NOT THE SAME!
Mar 23, 2012 at 20:53 answer added Karl Knechtel timeline score: 23
Mar 23, 2012 at 20:52 comment added Mark Ransom @NiklasB. is right - the UTF-8 encoded bytes are also valid Unicode codepoints so there's no way to tell what's what reliably.
Mar 23, 2012 at 20:51 comment added Etienne Perot All the bytes in the input data are all UTF-8-encoded characters, so I think it is safe to assume that every sequence of bytes in the initial string can be safely decoded from UTF-8
Mar 23, 2012 at 20:50 vote accept Etienne Perot
Mar 23, 2012 at 20:58
Mar 23, 2012 at 20:48 comment added Niklas B. There's no reliable way to solve this because the input data doesn't contain enough information in the first place.
Mar 23, 2012 at 20:43 answer added kev timeline score: 5
Mar 23, 2012 at 20:42 answer added beerbajay timeline score: 12
Mar 23, 2012 at 20:05 history asked Etienne Perot CC BY-SA 3.0
toggle format

AltStyle によって変換されたページ (->オリジナル) /