Re: Small UTF-8 encoder

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Small UTF-8 encoder
From: "H. Diedrich" <hd2010@...>
Date: 2012年6月19日 18:46:39 +0200

Hi Patrick!

I did quite some experimenting with tiny UTF-8 handling here:

http://eonblast.com/trucount/
http://www.eonblast.com/trucount/lua-count-patch-0.1.tgz

Mainly concerning myself with getting the lenght.

For what it's worth, ended up with this to count string lenght:

/* UTF-8 count */
case LUA_TSTRING: {
unsigned char *p = (unsigned char *)getstr(rawtsvalue(rb));
unsigned char *q = p + tsvalue(rb)->len;
size_t count = 0;
while(p < q) if((*p++ & 0xC0) ^ 0x80) count++; /* count all lead bytes */
setnvalue(ra, cast_num(count)); break;
}

The rational is spread out across this mailing list. Basically, corrupt UTF-8 should be allowed to have undefined results.

Best,
Henning

Patrick Rapin schrieb:

Essentially as an exercise, I tried to write the smaller possible
UTF-8 encoder in Lua [1].
Compared to a naive implementation like in [2], it is around 2.6 times shorter.
Still, I am wondering if the code could be further shorted (not
counting space removal).

[1] https://gist.github.com/b0ae016da7b8f0b221ff
[2] http://lwn.net/Articles/493167/ (and that implementation doesn't
handle 4 bytes codes)

References:
- Small UTF-8 encoder, Patrick Rapin

Prev by Date: Re: Inheriting from a userdata object
Next by Date: Stack with a tail call is missing function names?
Previous by thread: Re: Small UTF-8 encoder
Next by thread: Stack with a tail call is missing function names?
Index(es):
- Date
- Thread