Timeline for Bytes in a unicode Python string

Current License: CC BY-SA 3.0

20 events

when toggle format	what	by	license	comment
Nov 21, 2020 at 9:01	answer	added	Tahirhan		timeline score: 0
Nov 27, 2016 at 23:45	comment	added	ShreevatsaR		Just to be clear: the presence of `\xNN` escape sequences does not mean they're UTF-8 bytes. Python represents Unicode code points in the range 0 to 7F by `\x` escape sequences (other than printable ascii characters and `\n` `\t` etc). See code here. Try this code: `for n in range(300): print hex(n), repr(unichr(n))`. For example, the character Ð (U+00D0) will be represented by `\xd0` rather than `\u+00d0` even in a Unicode string.
Mar 24, 2012 at 1:03	answer	added	Mark Tolonen		timeline score: 5
Mar 23, 2012 at 22:11	answer	added	georg		timeline score: 13
Mar 23, 2012 at 21:39	comment	added	Etienne Perot		@thg435 Nah, that's just because I took a substring of a word to keep this example string short enough (Full string was `Стандартный Захват Контрольных Точек`)
Mar 23, 2012 at 21:39	comment	added	Karl Knechtel		@tchrist indeed, if the data is indeed on disk nominally UTF-8 encoded, OP may be looking at a hopefully rare case of double-UTF. ;)
Mar 23, 2012 at 21:35	comment	added	georg		BTW, "Русский ек" doesn't seem to be valid either, it probably should read "Русский язык" (=Russian language), so I guess there's more than that broken.
Mar 23, 2012 at 21:24	comment	added	Winston Ewert		I think your best bet is to figure out how such a crazy string was generated in the first place. Only then can you figure out the best way to fix it. You maybe be able to avoid modifying the code responsible, but you probably can't avoid understand it.
Mar 23, 2012 at 21:22	vote	accept	Etienne Perot
Mar 23, 2012 at 21:20	comment	added	tchrist		What’s happening is that the second part of your string has been double-encoded, which causes it to appear to have a bunch of code points < 255, which interpreted as UTF-8 give the right value.
Mar 23, 2012 at 20:58	comment	added	Etienne Perot		Well then, I'm not really sure what that string is anymore... It is an object which representation starts with `u` (like unicode strings do) and which contains both `\uXXXX`'s (like unicode strings do) and `\xXX`'s (like byte strings do). All sequences of `\xXX`'s in the representation of the object also happen to be valid UTF-8 byte strings if they were byte strings (which they're not, because they're contained inside the unicode string). Not sure if that makes more sense, but I hope it did
Mar 23, 2012 at 20:53	comment	added	Mark Ransom		@EtiennePerot, if you're starting with a UTF-8 byte sequence then please add it to the question. What you've shown us is a Unicode string which is NOT THE SAME!
Mar 23, 2012 at 20:53	answer	added	Karl Knechtel		timeline score: 23
Mar 23, 2012 at 20:52	comment	added	Mark Ransom		@NiklasB. is right - the UTF-8 encoded bytes are also valid Unicode codepoints so there's no way to tell what's what reliably.
Mar 23, 2012 at 20:51	comment	added	Etienne Perot		All the bytes in the input data are all UTF-8-encoded characters, so I think it is safe to assume that every sequence of bytes in the initial string can be safely decoded from UTF-8
Mar 23, 2012 at 20:50	vote	accept	Etienne Perot
Mar 23, 2012 at 20:58
Mar 23, 2012 at 20:48	comment	added	Niklas B.		There's no reliable way to solve this because the input data doesn't contain enough information in the first place.
Mar 23, 2012 at 20:43	answer	added	kev		timeline score: 5
Mar 23, 2012 at 20:42	answer	added	beerbajay		timeline score: 12
Mar 23, 2012 at 20:05	history	asked	Etienne Perot	CC BY-SA 3.0

toggle format

CollectivesTM on Stack Overflow

Timeline for Bytes in a unicode Python string

Current License: CC BY-SA 3.0