Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
[
Date Prev][
Date Next][
Thread Prev][
Thread Next]
[
Date Index]
[
Thread Index]
- Subject: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
 
- From: Tim Mensch <tim-lua-l@...>
 
- Date: 2012年2月08日 11:49:47 -0700
 
On 2/8/2012 11:01 AM, Dirk Laurie wrote:
(1) Additional functions in "string" library, e.g. str:usub(3,6) 
extracts UTF8 characters 3 to 6 and throws an error if str is not 
valid UTF8. Pro: simplest. Con: requires a change in 'official' Lua, 
can't genuinely start mid-string.
Is there some reason that I'm not getting that we couldn't add functions 
to "string"? Just that it's considered bad form?
Though it wouldn't be able to add the optimizations I suggested in 
another message if you didn't modify Lua proper, so no, you can't start 
mid-string.
(3) Another standard library, say "utf8", but operating on userdata, 
e.g. ustr:sub(3,6). ustr:type() is 'utf8'. Creates a private code 
point address list. Pro: avoids cons of (1) and (2). Con: requires 
conversion to-from string.
One of the key advantages of using UTF-8 is that you're just 
manipulating strings, so not being able to convert trivially is 
annoying. Obviously it could have a __tostring function, though, so 
converting in that direction doesn't need to be painful. If only 
concatenation (..) would run __tostring on a table parameter, we'd be 
set with the userdata approach. And to convert the other direction, a 
short function name like "_u" could make that easy: _u"string to make a 
UTF-8 object out of".
I pretty much need such a function anyway, since the sane way to do 
internationalization is to put your strings in a table somewhere which 
gets switched based on your locale, so in the code I'll have:
_t "English version of the string."
...and then elsewhere I'll have a table:
{
 ["English version of the string."] = "A translation of the string to 
another language."
}
But your item [2] really kills all of these ideas. If we can't have 
ustr:match, we may as well compile Lua with 16-bit Unicode strings if 
our locale is fundamentally non-ASCII.
Yuck. I would suggest that 16-bit Unicode was NEVER a good idea. Not 
even counting combining characters, you can't even fit all of the 
Unicode code points in 16-bits (over 110,000 now [1]), so some of them 
take two words to store ("surrogate pairs"). This means that you can't 
reliably index a UTF-16 string using offsets, and direct indexing of 
characters is the only argument I've heard in favor of UTF-16.
Aside from that, apart from Windows, the rest of the world seems to be 
moving toward UTF-8 as a standard encoding. (I know that it's more 
complicated in some countries, but still it seems to be the general trend.)
Making the pattern matching work for UTF-8 strings wouldn't be rocket 
science. As was pointed out in another message, MOST of the patterns 
would work MOSTLY as-is. I bet it wouldn't take more than a few minor 
patches to make a version of match() that would work fine for UTF-8.
Tim
[1] http://en.wikipedia.org/wiki/Unicode