Grapheme clusters, a.k.a.real characters

Terry Reedy tjreedy at udel.edu
Fri Jul 14 17:12:10 EDT 2017


On 7/14/2017 10:30 AM, Michael Torrie wrote:
> On 07/14/2017 07:31 AM, Marko Rauhamaa wrote:
>> Of course, UTF-8 in a bytes object doesn't make the situation any
>> better, but does it make it any worse?
>>>>> As it stands, we have
>>>> è --[encode>-- Unicode --[reencode>-- UTF-8
>>>> Why is one encoding format better than the other?

All digital data are ultimately bits, usually collected together in 
groups of 8, called bytes. The point of python 3 is that text should 
normally be instances of a text class, separate from the raw bytes 
class, with a defined internal encoding. The actual internal encoding 
is secondary. And it changed in 3.3.
Python ints are encoded bytes, so are floats, and everything else. When 
one prints a float, one certainly does not see a representation of the 
raw bytes in the float object. Instead, one sees a representation of 
the value it represents. There is a proposal to change the internal 
encoding of int, as least on 64-bit machines, which are now standard. 
However, because print(87987282738472387429748) prints 
87987282738472387429748 and not the internal bytes, the change in the 
internal bytes will not affect the user view of ints.
> This is precisely the logic behind Google using UTF-8 for strings in Go,
> rather than having some O(1) abstract type like Python has. And many
> other languages do the same. The argument is that because of the very
> issues that you mention, having O(1) lookup in a string isn't that
> important, since looking up a particular index in a unicode string is
> rarely the right thing to do, so UTF-8 is just fine as a native,
> in-memory type.

Does go use bytes for text, like most people did in Python 2, a separate 
text string class, that hides the internal encoding format and 
implementation? In other words, if you do the equivalent of print(s) 
where s is a text string with a mixture of greek, cyrillic, hindi, 
chinese, japanese, and korean chars, do you see the characters, or some 
representation of the internal bytes?
-- 
Terry Jan Reedy


More information about the Python-list mailing list

AltStyle によって変換されたページ (->オリジナル) /