lua-users home
lua-l archive

Re: Plea for the support of unicode escape sequences

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Petite Abeille wrote:
> What's wrong with hex sequences?
> 
> print( '\xE2\x86\x92' )
I guess the problem is that most tables only tell you the codepoint but
not the UTF-8 encoding. The UTF-8 encoding currently has the advantage
that it is clear to the user why the usual pattern matching fails.
However, Mike does have a point. If someone builds UTF-8 libraries it
would be quite convenient if codepoint escape sequences are available.
However, I guess the sequences shouldn't be advertised in the Lua manual
and the limitations have to be clearly stated. Without the correct
library functions to support them, the codepoint escapes are likely to
cause confusion.
I wonder how compact you can store the character classes for the 65k
codepoints in the BMP and the lowercase/uppercase pairs (for
string.lower, string.upper). Maybe that can be compressed far enough to
be included in official Lua (5.3?). That would be great.
-- David
PS: An illustration for the usefulness of UTF-8 libraries:
In my JSON implementation I wanted to include the JavaScript-regexp
> /[\\\"\x00-\x1f\x7f-\x9f\u00ad\u0600-\u0604\u070f\u17b4\u17b5\u200c-\u200f\u2028-\u202f\u2060-\u206f\ufeff\ufff0-\uffff]/g
In Lua that turned to:
> local function quotestring (value)
> -- based on the regexp "escapable" in https://github.com/douglascrockford/JSON-js
> value = fsub (value, "[%z1円-31円\"\\127円]", escapeutf8)
> if strfind (value, "[194円216円220円225円226円239円]") then
> value = fsub (value, "194円[128円-159円173円]", escapeutf8)
> value = fsub (value, "216円[128円-132円]", escapeutf8)
> value = fsub (value, "220円143円", escapeutf8)
> value = fsub (value, "225円158円[180円181円]", escapeutf8)
> value = fsub (value, "226円128円[140円-143円168円175円]", escapeutf8)
> value = fsub (value, "226円129円[160円-175円]", escapeutf8)
> value = fsub (value, "239円187円191円", escapeutf8)
> value = fsub (value, "239円191円[190円191円]", escapeutf8)
> end
> return "\"" .. value .. "\""
> end
(fsub is just an optimization for gsub).
Or LPeg:
> local SpecialChars = (R"0円31円" + S"\"\\127円" +
> P"194円" * (R"128円159円" + P"173円") +
> P"216円" * R"128円132円" +
> P"220円132円" +
> P"225円158円" * S"180円181円" +
> P"226円128円" * (R"140円143円" + S"168円175円") +
> P"226円129円" * R"160円175円" +
> P"239円187円191円" +
> P"229円191円" + S"190円191円") / escapeutf8
> 
> local QuoteStr = g.Cs (g.Cc "\"" * (SpecialChars + 1)^0 * g.Cc "\"")
(I guess there are already libraries and Lua bindings to make this
easier, but the point of my JSON library was to stay independent and
easy to use in environments like MUD clients where you might not have
much more than pure Lua).

AltStyle によって変換されたページ (->オリジナル) /