i am using this text [1] to test UTF-8
character counting.
Does somebody know how to get an authoritative count of how many
that should actually be? Mines possible invalid ones, should they
be in that text?
I am using this primitive counting mechanism. Inspired by [2].
Proposals to improve are welcome.
Does size_t make sense?
/* UTF-8 estimate */
unsigned char *p = (unsigned char *)getstr(rawtsvalue(rb));
unsigned char *q = p + tsvalue(rb)->len;
size_t count = 0;
while(p < q)
if(*p <= 127 || (*p >= 194 && *p <=
244)) /* this can be reversed */
p++;
The above nails the sample text by 2 characters. I am looking for
the cause of the discrepancy.
Thanks,
Henning
[1]
https://gist.github.com/768309
[2]
http://lua-users.org/wiki/LuaUnicode