lua-users home
lua-l archive

UTF-8 testing

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


i am using this text [1] to test UTF-8 character counting.

Does somebody know how to get an authoritative count of how many that should actually be? Mines possible invalid ones, should they be in that text?

I am using this primitive counting mechanism. Inspired by [2]. Proposals to improve are welcome.

Does size_t make sense?

/* UTF-8 estimate */
unsigned char *p = (unsigned char *)getstr(rawtsvalue(rb));
unsigned char *q = p + tsvalue(rb)->len;
size_t count = 0;
while(p < q)
if(*p <= 127 || (*p >= 194 && *p <= 244)) /* this can be reversed */
p++;

The above nails the sample text by 2 characters. I am looking for the cause of the discrepancy.

Thanks,
Henning

[1] https://gist.github.com/768309
[2] http://lua-users.org/wiki/LuaUnicode

AltStyle によって変換されたページ (->オリジナル) /