Re: Changes in the validation of UTF-8

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Changes in the validation of UTF-8
From: Roberto Ierusalimschy <roberto@...>
Date: 2019年3月20日 10:58:58 -0300

> What I think is a backwards step, is the lexer accepting "\u{110000}"
> Unicode escapes >10FFFF should really be an error IMO.
If you write "\u{110000}", you are explicitly asking for an
invalid code. If you want invalid codes, you might as well write
"\xf4\x90\x80\x80" or "244円144円128円128円". "\u{110000}" just makes
it easier.
(But "\u{110000}" is hardly as useful as utf8.char(0x110000).
I was wondering about removing this laxity in \u; but then surrogates
should be invalid too. Is that good?)
> UTF8PATT accepting deprecated 5 and 6 byte sequences is a similarly
> undesirable change.
UTF8PATT already accepted all kinds of wrong stuff, including
overlonging sequences. 5 and 6 byte sequences is the least of the
problems here. The documentation is (and was) clear that you should use
it only on valid strings.
> Accepting unpaired surrogates isn't odd, and is unfortunately required
> when working with many badly designed APIs (e.g. windows file paths,
> javascript). utf-8 with unpaired surrogates allowed is often called
> "wtf-8". https://simonsapin.github.io/wtf-8/
That's the whole point: It is useful to be able to work with invalid
codes. Why is 110000 "more invalid" than a surrogate? If you are going
to accept surrogates, why not do go the whole way and accept what UTF-8
was designed for?
-- Roberto

References:
- Changes in the validation of UTF-8, Daurnimator
- Re: Changes in the validation of UTF-8, Roberto Ierusalimschy
- Re: Changes in the validation of UTF-8, Andrew Gierth
- Re: Changes in the validation of UTF-8, Roberto Ierusalimschy
- Re: Changes in the validation of UTF-8, Daurnimator

Prev by Date: Re: Changes in the validation of UTF-8
Next by Date: Looking for Lua Binaries for MSVC 2017
Previous by thread: Re: Changes in the validation of UTF-8
Next by thread: Re: Changes in the validation of UTF-8
Index(es):
- Date
- Thread