Re: Clearing up misconceptions about characters vs bytes in the manual

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Clearing up misconceptions about characters vs bytes in the manual
From: spir <denis.spir@...>
Date: 2012年11月02日 19:35:27 +0100

On 02/11/2012 14:27, Rob Hoelz wrote:

Hi list,
A user came on the IRC channel today asking about Unicode support.
When I tried to explain that string.sub(2, 2) and io.read(1) wouldn't
work as expected on UTF-8 data by explaining that Lua only uses 8-bit
clean strings and doesn't understand UTF-8 data beyond a string of
bytes, the user pointed out that the manual speaks in terms of
characters, not bytes.
Would it be a good idea to make a distinction between characters and
bytes, or do you guys feel that this is already clear in the manual
(and PiL)?
-Rob

I think it would not hurt to repeat, where relevant, Lua-char = byte.

However, probably manuals should not go farther about mentioning unicode, elsethey may introduce the *very* usual "misconception about characters vs"so-called "abstract characters"; which are in fact an intermediate state betweenbytes (or other code units) and characters proper.A Lua text is a string of bytes, to get a safe index or pair of indices launch aproper search function (*), et voilà!

Denis

(*) unless you're dealing with machine-generated and user-inaccessible plainascii source

References:
- Clearing up misconceptions about characters vs bytes in the manual, Rob Hoelz

Prev by Date: Re: Bug: Literal strings in long format are not quite literal.
Next by Date: Re: Clearing up misconceptions about characters vs bytes in the manual
Previous by thread: Re: Clearing up misconceptions about characters vs bytes in the manual
Next by thread: Bug: Literal strings in long format are not quite literal.
Index(es):
- Date
- Thread