Re: Clearing up misconceptions about characters vs bytes in the manual
[
Date Prev][
Date Next][
Thread Prev][
Thread Next]
[
Date Index]
[
Thread Index]
- Subject: Re: Clearing up misconceptions about characters vs bytes in the manual
- From: spir <denis.spir@...>
- Date: 2012年11月02日 19:35:27 +0100
On 02/11/2012 14:27, Rob Hoelz wrote:
Hi list,
A user came on the IRC channel today asking about Unicode support.
When I tried to explain that string.sub(2, 2) and io.read(1) wouldn't
work as expected on UTF-8 data by explaining that Lua only uses 8-bit
clean strings and doesn't understand UTF-8 data beyond a string of
bytes, the user pointed out that the manual speaks in terms of
characters, not bytes.
Would it be a good idea to make a distinction between characters and
bytes, or do you guys feel that this is already clear in the manual
(and PiL)?
-Rob
I think it would not hurt to repeat, where relevant, Lua-char = byte.
However, probably manuals should not go farther about mentioning unicode, else
they may introduce the *very* usual "misconception about characters vs"
so-called "abstract characters"; which are in fact an intermediate state between
bytes (or other code units) and characters proper.
A Lua text is a string of bytes, to get a safe index or pair of indices launch a
proper search function (*), et voilà!
Denis
(*) unless you're dealing with machine-generated and user-inaccessible plain
ascii source