Re: Of Unicode in the next Lua version

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Of Unicode in the next Lua version
From: Pierre-Yves Gérardy <pygy79@...>
Date: 2013年6月15日 20:13:31 +0200

On Sat, Jun 15, 2013 at 3:52 PM, Roberto Ierusalimschy
<roberto@inf.puc-rio.br> wrote:
>
> You can already easily implement this ǵetchar' in standard Lua (except
> that it assumes a well-formed string):
>
> S = "∂ƒ"
> print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 1)) --> '∂', 4
> print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 4)) --> 'ƒ', 6
> print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 6)) --> nil
Thanks for this pattern trick, it helped me to improve my
`getcodepoint()` routine (although I eventually found a faster
method). A validation routine won't be of much help
if you deal with a document that sports multiple encodings.
If you want to validate the characters on the go and always get a position as
second argument, you need something like this:
If the character is not valid, it returns `false, position`. At the
end of the stream, it returns nil, position + 1.
 local s_byte, s_match, s_sub = string.byte, string.match, string.sub
 function getchar(S, first)
 if #S < first then
 return nil, first
 end
 local match, next = S:match("^([^128円-191円][128円-191円]*)()", first)
 if not match then
 return false, first
 end
 local first, n = s_byte(match), #match
 local success
 = first < 0x128 and n == 1
 or first < 0x224 and n == 2
 or first < 0x240 and n == 3
 or first < 0x248 and n == 4
 or first < 0x252 and n == 5
 or first < 0x254 and n == 6
 --UTF-16 surrogate code point checking left out for clarity.
 if success then
 return match, next
 else
 return false, first
 end
 end
or this (idem in Lua 5.1/5.2, but twice as fast in LuaJIT, where
`gmatch()` is not compiled):
function utf8_get_char_jit_valid2(subject, i)
 if i > #subject then
 return nil, i
 end
 local byte, len = s_byte(subject,i)
 if byte < 128 then
 return s_sub(subject, i, i), i + 1
 elseif byte < 192 then
 return false, i
 elseif byte < 224 and s_match(subject, "^[128円-191円]",
 i + 1) then
 return s_sub(subject, i, i + 1), i + 2
 elseif byte < 240 and s_match(subject,
 "^[128円-191円][128円-191円]",
 i + 1) then
 return s_sub(subject, i, i + 2), i + 3
 elseif byte < 248 and s_match(subject,
 "^[128円-191円][128円-191円][128円-191円]",
 i + 1) then
 return s_sub(subject, i, i + 3), i + 4
 elseif byte < 252 and s_match(subject,
 "^[128円-191円][128円-191円][128円-191円][128円-191円]",
 i + 1) then
 return s_sub(subject, i, i + 4), i + 5
 elseif byte < 254 and s_match(subject,
 "^[128円-191円][128円-191円][128円-191円][128円-191円][128円-191円]",
 i + 1) then
 return s_sub(subject, i, i + 5), i + 6
 else
 return false, i
 end
 end
This is not that complex, but still rather slow in Lua, and the same
goes for getting the code point to perform a range query (useful to
test if a code point is part of some alphabet).
To that, end, you could provide a `utf8.range(char, lower, upper)`, though.
This assumes you don't deprecate patterns in the next Lua version (or
the one after, to ease the transition?).
But I understand the need to balance features and light weight.
`getchar()` and `getcodepoint()` are damn useful to write parsers, but
if LPeg is part of the next version, the point is probably moot.
-- Pierre-Yves

Follow-Ups:
- Re: Of Unicode in the next Lua version, Jay Carlson

References:
- Of Unicode in the next Lua version, Pierre-Yves Gérardy
- Re: Of Unicode in the next Lua version, Roberto Ierusalimschy

Prev by Date: Re: Dynamic SQL in Lua
Next by Date: Re: [ANN] L2DBUS - Lua-2-DBus Binding for D-Bus
Previous by thread: Re: Of Unicode in the next Lua version
Next by thread: Re: Of Unicode in the next Lua version
Index(es):
- Date
- Thread