HyperHacker wrote: [...] > I do think a simple UTF-8 library would be quite a good thing to have > - basically just have all of Lua's string methods, but operating on > characters instead of bytes. What do you mean by a 'character'? A Unicode code point? A grapheme cluster? If you split the string on code points you'll end up breaking grapheme clusters in the middle, which will break any combining characters. If you split the string on grapheme clusters you'll preserve the ability to do random access into the string, but your string manipulation library now becomes hideously heavyweight: grapheme clusters can be *any length* (although there seems to be a promise that normalised Unicode won't have any grapheme clusters longer than 32 code points). The standard intuition that strings are made up of an array of characters is, unfortunately, not really true in Unicode. It's basically not possible to do random access into a Unicode string without jumping through painful hoops. -- ┌─── dg@cowlark.com ───── http://www.cowlark.com ───── │ "I have always wished for my computer to be as easy to use as my │ telephone; my wish has come true because I can no longer figure out │ how to use my telephone." --- Bjarne Stroustrup
Attachment:
signature.asc
Description: OpenPGP digital signature