Universal Character Names, v2

Sun Dec 1 16:24:00 GMT 2002

Zack Weinberg wrote:-
> modulo the fact that we may not support binary encodings yet.

I've had more thoughts about arbitrary charsets. Rather than converting
to UTF-8 on a per-character basis, the obvious place is to convert
a line-at-a-time from the new-line handler (plus a call when starting
a buffer to get the process started). This would vastly reduce most of the
overhead issues. We're best using our own converters, and adding them
one-by-one on demand (a la GNAT), rather than relying on host
implementations of mbtowc or iconv, IMO. Since they're scanning the
line, they may as well do trigraph conversion at the same time, and
possibly splice lines.
That would leave the question of whether we have a scan to do this for
the normal case (and thereby stop mmapping and our NUL trick).
Good caret diagnostics in this situation are best handled, I think,
by changing from line/col location via line-map to a single "unsigned
int" representing the position in the translation unit in logical
characters.
Thoughts?
Neil.