UTF-8 testing

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: UTF-8 testing
From: Henning Diedrich <hd2010@...>
Date: 2011年1月06日 19:40:08 +0100

i am using this text [1] to test UTF-8 character counting.

Does somebody know how to get an authoritative count of how many that should actually be? Mines possible invalid ones, should they be in that text?

I am using this primitive counting mechanism. Inspired by [2]. Proposals to improve are welcome.

Does size_t make sense?

/* UTF-8 estimate */
unsigned char *p = (unsigned char *)getstr(rawtsvalue(rb));
unsigned char *q = p + tsvalue(rb)->len;
size_t count = 0;
while(p < q)
if(*p <= 127 || (*p >= 194 && *p <= 244)) /* this can be reversed */
p++;

The above nails the sample text by 2 characters. I am looking for the cause of the discrepancy.

Thanks,
Henning

[1] https://gist.github.com/768309
[2] http://lua-users.org/wiki/LuaUnicode

Follow-Ups:
- Re: UTF-8 testing, Eero Pajarre
- Re: UTF-8 testing, Paul Hudson
- Re: UTF-8 testing, Sean Conner

Prev by Date: Re: Bindings for Amazon S3
Next by Date: Re: Lua Cookbook
Previous by thread: Re: Bindings for Amazon S3
Next by thread: Re: UTF-8 testing
Index(es):
- Date
- Thread