lua-users home
lua-l archive

Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Wed, Feb 8, 2012 at 4:16 AM, Miles Bader <miles@gnu.org> wrote:
> Jay Carlson <nop@nop.com> writes:
>>  But computers are already pervasive in Japan and the Republic of
>> Korea too, and the People's Republic of China is on its way--and the
>> CJK written languages are difficult to handle in unextended Lua.
>
> Wait... how are CJK "hard to handle in unextended Lua" ...?
>   >    for file in lfs.dir (os.getenv ("HOME").."/"..tanka_dir) do
>   >       author = string.match (file, "^(.*)[.]txt$")
Fails when the filesystem is not UTF-8. Won't affect you, because you
know this is not true, and this is a fairly safe assumption in your
environment.
>   >          for line in io.lines (tanka_dir.."/"..file) do
>   >             print ("   "..line)
Oh dear, who put that that SJIS fie in that directory?
Given how many of string.* operations are closed over the formal
language of valid UTF-8[1], it should not be surprising that a lot of
us are already using UTF-8 in small apps without a smidgen of magic.
This works because a) we're pretty fanatic about keeping *everything*
in UTF-8 and b) we know which string operations are not closed and
avoid them or arrange for other preconditions. The question then is
whether errors should be silent or loud, and when. It's part b that
really worries me, because I'm going to blow it at some point; I'm not
perfect and I still think in ASCII.
In that example (and nobody expects throwaway examples on mailing
lists to be comprehensive) you do not significantly destructure text
aside from splitting on newlines. I agree this is a good example of
"you don't do the complicated stuff much" but at some point this
begins to shift to "you don't do the complicated stuff much because
you restructure your problems to avoid non-ASCII/non-literal-match
manipulation". In my copious free time, I'd like to look through PiL's
string manipulation and see what does and doesn't work, and what works
with which fixups and assumptions. Uh, and maybe look at email not
related to this. :-)
I avoid programming in C because I find it an anxious experience. So
much of C programming failure ends in the form "...behavior is
undefined, and usually brings down the runtime in the near future. If
you're lucky." This is perhaps the strongest argument against C and
C++ as teaching languages: programming mistakes result in
unpredictable failure modes.
Lua does not have that failure mode, but propagation of encoding
errors can work out that way with increasing levels of "what the heck
happened?" I would rather not have Lua UTF-8 handling be the same way.
Manual localization of errors is the easiest, but the assert_utf8()
tool is missing. Is it a battery? Marketing and educational issues
aside, there seems to be a very efficient memoization implementation
available only in core, attaching the tristate { unknown, valid,
invalid } to the string itself; perhaps then wrap string.* in asserts
on the Lua side?[2] And stepping back from the specfifics of UTF-8, a
validity marker would be useful in EUC as well.
I hope it is apparent I am exploring what a minimal implementation
might be. There are all kinds of other doodads I contemplate
associated with strings because this is an instance of a general
problem I wish my programming languages would help me with. I'm trying
to focus on to the UTF-8 problem since it is most likely to be
generally seen as a problem. The "no changes needed" alternative is
also to say that ANSI C already has UTF-8 support, and as a result so
does Lua. I find this dissatisfying.
Jay
[1]: Why yes, if UTF-8 processing is how we do Unicode processing, and
we don't have the character property tables, we've reduced this to a
trivial case of the whole "strings have types; will your language help
you?" question. It's just a very simple language.
[2]: Patterns look very difficult to fix up on the Lua side though.

AltStyle によって変換されたページ (->オリジナル) /