lua-users home
lua-l archive

Re: Should Lua be more strict about Unicode errors?

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


2015年09月02日 18:03 GMT+02:00 Jay Carlson <nop@nop.com>:
> It should be more illegal. :-) 0xd800 is outside the domain of any function converting codepoints to UTF-8. What possible UTF-8 string can it return?
> ("%X%X%X"):format(string.byte(utf8.char(0xd800),1,-1))
EDA080
This string is translated back to 55296 (i.e. 0xd800) by e.g. the
'utf8' pattern in the LPeg manual.
> The way I read the Lua manual, you should be able to understand
> Lua's approach to UTF-8 by just reading the RFC.
On the contrary, the three-letter sequence RFC does not occur
in the manual. I estimate that not more than 1% of people who
have read the Lua manual have also read RFC3629. Quite a
few more have read the Wikipedia page, though, which says on
this topic:
~~~
According to the UTF-8 definition (RFC 3629) the high and low
surrogate halves used by UTF-16 (U+D800 through U+DFFF) are not legal
Unicode values, and their UTF-8 encoding should be treated as an
invalid byte sequence.
Whether an actual application should do this is debatable, as it makes
it impossible to store invalid UTF-16 (that is, UTF-16 with unpaired
surrogate halves) in a UTF-8 string. This is necessary to store
unchecked UTF-16 such as Windows filenames as UTF-8. It is also
incompatible with CESU encoding (described below).
~~~

AltStyle によって変換されたページ (->オリジナル) /