Re: question about Unicode

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: question about Unicode
From: Rici Lake <lua@...>
Date: Thu, 7 Dec 2006 17:10:47 -0500

On 7-Dec-06, at 4:56 PM, Glenn Maynard wrote:

UTF-32 at least does away with the last: a single data element(wchar_t)
always represents a single codepoint. That codepoint may not represent
the entire glyph, but that's a separate problem--in UTF-16, you have
to cope with both decoding codepoints, and combining multiplecodepoints
into one glyph, which are different issues causing different problems.

Actually, I think you could solve both of those problems with the samecode. You're not out of the woods using UTF-32, in terms of decoding,unless you're not validating the codes; with UTF-32 the surrogate codesare illegal (as are codes >= 2^20 + 2^16).

(I suspect that a lot of application-level UTF-16 code simply ignores
surrogate pairs, turning it into UCS-2, though.)

Yes. When such code is combined with more modern libraries, it cancause ugly things to happen -- I think that is why some ncursesinstallations crash when given characters outside of the BMP.I completely agree that UTF-16 is not appropriate as an exchangeformat. UTF-8 has the advantage of be resynchronizable, for example,even if it is sometimes bulkier -- and, in any event, if you usecompression you'll get roughly the same transmission length for anyunicode format.

Follow-Ups:
- Re: question about Unicode, Glenn Maynard

References:
- question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, Matt Campbell
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Jones
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Given
- Re: question about Unicode, Rici Lake
- Re: question about Unicode, Glenn Maynard

Prev by Date: Re: question about Unicode
Next by Date: Re: question about Unicode
Previous by thread: Re: question about Unicode
Next by thread: Re: question about Unicode
Index(es):
- Date
- Thread