[Python-Dev] UCS2/UCS4 default

Thu Jul 3 18:45:39 CEST 2008

On Jul 3, 2008, at 10:46 AM, Jeroen Ruigrok van der Werven wrote:
> -On [20080703 15:58], Guido van Rossum (guido at python.org) wrote:
>> Your seem to be suggesting that len(u"\U00012345") should return 1 on
>> a system that internally uses UTF-16 and hence represents this string
>> as a surrogate pair.
>> From a Unicode and UTF-16 point of view that makes the most sense. 
> So yes, I
> am suggesting that.

I think this is misguided.
IMO, basically every programming language gets string handling wrong. 
(maybe with the exception of the unreleased perl6? it had some 
interesting moves in this area, but I haven't really been paying 
attention.) Everyone treats strings as arrays, but they are used quite 
differently. For a string, there is hardly ever a time when a 
programmer needs to index it with an arbitrary offset in number of 
codepoints, and the length-in-codepoints is pretty non-useful as well. 
Constant-time access to arbitrary codepoints in a string is pretty 
much unimportant. What *is* of utmost importantance is constant-time 
access to previously-returned points in the string.
I'd like to have 3 levels of access available:
1) "byte"-level. In a new implementation I'd probably choose to make 
all my strings stored in UTF-8, but UTF-16 is fine too.
2) codepoint-level.
3) grapheme-level.
You should be able to iterate over the string at any of the levels, 
ask for the nearest codepoint/grapheme boundary to the left or right 
of an index at a different level, etc.
Python could probably still be made to work kinda like this. I think a 
language designed as such in the first place could be nicer, with 
opaque index objects into the string rather than integers, and such, 
but...whatever.
Let's assume python is changed to always store strings in UTF-16.
All it would take is adding a few more functions to the str object to 
operate on the higher levels. Wherever I say "pos" I mean an integer 
index into the string, at the UTF-16 level. That may sometimes be 
unaligned with the boundary of the representation you're asking about, 
and behavior in that case needs to be specified as well.
.nextcodepoint(curpos, how_many=1) -> returns an index into the string 
how_many codepoints to the right (or left if negative) of the index 
curpos.
.nextgrapheme(curpos, how_many=1) -> returns an index into the string 
how_many graphemes to the right (or left if negative) of the index 
curpos.
.codepoints(from_pos=0, to_pos=None) -> return an iterator of 
codepoints from 'from_pos' to 'to_pos'. I think codepoints could be 
represented as strings themselves (so usually one character, sometimes 
two character strings).
.graphemes(from_pos=0, to_pos=None) -> return an iterator of graphemes 
from 'from_pos' to 'to_pos'. Also could be represented by strings. The 
returned graphemes should probably be normalized.
There are a few more desirable operations, to manipulate strings at 
the grapheme level (because unlike for UTF-8/UTF-16 codepoints, 
graphemes don't have the nice property of not containing prefixes 
which are themselves valid graphemes). So, you want a find (and 
everything else that implicitly does a find operation, like split, 
replace, strip, etc) which requires that both endpoints of its match 
are on a grapheme-boundary. [[Probably the easiest way to implement 
this would be in the regexp engine.]]
A concrete example of that: u'A\N{COMBINING TILDE}\N{COMBINING MACRON 
BELOW}'.find(u'A\N{COMBINING TILDE}') returns 0. But you want a way to 
ask for only a *actual* "A with tilde", not an "A with tilde and 
macron".
Anyhow, I'm not going to tackle this issue or try to push it further, 
but if someone does tackle it, python could grow to have the best 
unicode available. :)
James