Russ Cox wrote: > These are my opinions, but they are the result of lots of time > working with these issues. Eek! I didn't mean to start such a debate... I appear to have struck a nerve! [...] > 1. You should give up on trying to write an identifier name > in one character set in one file and referring to it using > a different character set in another source file. I think this is reasonable. It fits the Lua philosophy to declare a simple mechanism (the one I proposed) that *allows* source files to be in whatever ASCII-compatible encoding you like, but doesn't require one. This allows users to use UTF-8 or Latin-1 or Shift-JIS or whatever they want. It doesn't solve the issue of what happens if the user wants to do something complicated, like mix encodings --- I think it's fair to require the user to think first when doing that. If all else fails, it's easy enough to just run your source through iconv first. It also doesn't solve the normalisation problem, which is potentially quite serious, but I don't think that's solvable without introducing UTF-8 specific behaviour. It will also fail on any encoding that uses low-bit characters as part of an extended sequence. If there's an encoding that uses <high> <low1> <low2> as part of a single character, then <low1> and <low2> may potentially confuse the parser. This scheme would only work on encodings where *all* bytes of an extended character have the top bit set. I believe that includes Shift-JIS as well as UTF-8. And it doesn't make the string library support anything other than ASCII, but then I don't think it's the default string library's *job* to do that. I agree with everything else you say, BTW, except that I usually like processing strings as UTF-8. It's slower than UTF-16, but it does force you to get it right in order to get it done at all, it's much less memory-hungry (particularly for western languages), and in most cases it's fast enough. ... BTW, if you want to see true madness, check out the other UTF forms. UTF-7 is bizarre enough. There were plans for UTF-5 for legacy teletype systems and radio (it's compatible with baudot code). And as for UTF-EBCDIC... -- ╭─┈David Given┈──McQ─╮ "...electrons, nuclei and other particles are good │┈ dg@cowlark.com┈┈┈┈│ approximations to perfectly elastic spherical │┈(dg@tao-group.com)┈│ cows." --- David M. Palmer on r.a.sf.c ╰─┈www.cowlark.com┈──╯
Attachment:
signature.asc
Description: OpenPGP digital signature