On 09/02/12 18:37, Roberto Ierusalimschy wrote:
[...]
> utf8.codepoint(s, i, j) -> code points in s from *byte* offset i to j
> (default i=1, j=i); i adjusts backward and j adjusts forward until a
> proper frontier. (It might be useful another function to return a table
> with those code points; {utf8.codepoint(s, 1, -1)} may be too heavy.)
The primitive I use most when dealing with Unicode is 'given a byte
offset i, get me the next code point and advance i accordingly'. This
nearly does that, but not quite.
In hindsight, what I *should* have been using was 'given a byte offset
i, get me the next *grapheme cluster* (as a string) and advance i
accordingly'. Unfortunately, while I do know there are rules for
automatically determining grapheme cluster boundaries, I suspect they're
too heavy for this kind of low-level stuff.
Incidentally, just for fun, I recently found this grapheme cluster,
which appears to be the longest one usable in real life:
U+0f67 U+0f90 U+0fb5 U+0fa8 U+0fb3 U+0fba U+0fbc U+0fbb U+0f82
It's the Tibetan symbol HAKṢHMALAWARAYAṀ, and looks like this (if you're
lucky):
ཧྐྵྨླྺྼྻྂ
--
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────
│
│ "Never attribute to malice what can be adequately explained by
│ stupidity." --- Nick Diamos (Hanlon's Razor)
Attachment:
signature.asc
Description: OpenPGP digital signature