lua-users home
lua-l archive

Re: UTF-8 testing

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Sean Conner <sean@conman.org> writes:
> Here's a way of determining the length of a UTF-8 string; it assumes a
> valid UTF-8 string to begin with:
>
> static const char m_trailingbytes[256] =
> {
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
> 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
> };
>
> size_t utflen(const char *st,size_t size)
> {
...
> l = m_trailingbytes[*s] + 1; 
Hmm, one could store the table indexed by "byte >> 2" as well, to
save some space:
 static const char m_trailingbytes[64] =
 {
 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0,
 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0, 1,1,1,1, 1,1,1,1, 2,2,2,2, 3,3,4,5
 };
 size_t utflen(const char *st,size_t size)
 {
...
 l = m_trailingbytes[(unsigned)*s >> 2] + 1; 
[right?]
-miles
-- 
Religion, n. A daughter of Hope and Fear, explaining to Ignorance the nature
of the Unknowable.

AltStyle によって変換されたページ (->オリジナル) /