Hi
Patrick!
I did quite some experimenting with tiny UTF-8 handling here:
http://eonblast.com/trucount/
http://www.eonblast.com/trucount/lua-count-patch-0.1.tgz
Mainly concerning myself with getting the lenght.
For what it's worth, ended up with this to count string lenght:
/* UTF-8 count */
case LUA_TSTRING: {
unsigned char *p = (unsigned char *)getstr(rawtsvalue(rb));
unsigned char *q = p + tsvalue(rb)->len;
size_t count = 0;
while(p < q) if((*p++ & 0xC0) ^ 0x80) count++; /*
count all lead bytes */
setnvalue(ra, cast_num(count)); break;
}
The rational is spread out across this mailing list. Basically, corrupt
UTF-8 should be allowed to have undefined results.
Best,
Henning
Patrick Rapin schrieb:
Essentially as an exercise, I tried to write
the smaller possible
UTF-8 encoder in Lua [1].
Compared to a naive implementation like in [2], it is around 2.6 times
shorter.
Still, I am wondering if the code could be further shorted (not
counting space removal).
[1] https://gist.github.com/b0ae016da7b8f0b221ff
[2] http://lwn.net/Articles/493167/
(and that implementation doesn't
handle 4 bytes codes)