lua-users home
lua-l archive

Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Roberto Ierusalimschy <roberto@inf.puc-rio.br> writes:
> A very basic support for UTF-8, in the lines suggested by Miles Bader,
> seems a good start. Something more or less like this:
Oooh, nice to see something real!
Maybe I'm missing something, but there seems to be missing a way to
efficiently compute "incremental" character byte-offsets in a string,
which might be used when iterating over utf8 characters a string
(possibly starting from some deep interior point).
[In my prev message I called this "char_offset" (maybe not such a good name):
 utf8.char_offset (STRING, BYTE_INDEX, NUM_CHARS) => NEW_BYTE_INDEX]
Your utf8.byteoffsets seems the closest in spirit, but won't be
efficient in many cases because it always has to scan the string from
the beginning.
Maybe if you added an optional "start_offset" parameter to
utf8.byteoffsets:
 utf8.byteoffset(s, l, [start_offset])
 -> offset (in bytes) where 'l'-th code point from START_OFFSET (in
 bytes, default 1) starts
I think many higher-level utf8-aware interfaces will probably tend to
be written in terms of string byte-offsets, having an efficient way to
operating on interior string segments is important.
Consider an "output unicode characters to MUMBLE" function:
 function output_unicode_chars_to_mumble (mumble, string, start, end)
 start = start or 1
 end = end or #string
 -- iterate over STRING, outputting a single character at a time
 while start < end do
 local codepoint = utf8.codepoints (string, start)
 output_unicode_codepoint_to_mumble (mumble, codepoint)
 start = utf8.byteoffset (string, 1, start) -- increment START
 end
 end
Thanks,
-miles
-- 
"She looks like the wax version of herself."
 	 	 		 [Comment under a Paris Hilton fashion pic]

AltStyle によって変換されたページ (->オリジナル) /