On 01/04/11 17:37, Marc Balmer wrote: [...] > On the C level, quite a lot. strlen() and friends can no longer be > used, printf format strings like "%20s" don't work anymore etc. Not to > speak about string comparison, collation etc. Since I am not familiar > with LPeg's implementation, that is about all I can say. Determining the length of a Unicode string is a pretty fuzzy concept anyway --- AFAIK the only way to do it is to break it up into grapheme clusters and determine the size of each grapheme cluster individually (which may vary according to font). I tend to use a cheap and nasty mechanism for console applications that assumes that each code point is a grapheme cluster, and then uses a set of rules to decide whether they're of width 1 and 2. This works most of the time but not all of the time. See: http://wordgrinder.hg.sourceforge.net/hgweb/wordgrinder/wordgrinder/file/f658d1e8f1f3/src/c/emu/wcwidth.c In terms of what I'd like from LPEG is a set of primitives for matching a single code point and a single grapheme cluster (treating them as Lua strings, i.e. sequences of bytes). This would allow easier parsing of UTF-8 strings. The collation stuff might be useful but not only is it hideously complicated and involving massive tables, but I've never actually found a need for it, so I'd willing to live without it. -- ┌─── dg@cowlark.com ───── http://www.cowlark.com ───── │ "I have always wished for my computer to be as easy to use as my │ telephone; my wish has come true because I can no longer figure out │ how to use my telephone." --- Bjarne Stroustrup
Attachment:
signature.asc
Description: OpenPGP digital signature