[Python-Dev] PEP 263 considered faulty (for some Japanese)
Stephen J. Turnbull
stephen@xemacs.org
13 Mar 2002 18:11:42 +0900
>>>>> "Martin" == Martin v Loewis <martin@v.loewis.de> writes:
Martin> Reliable detection of encodings is a good thing, though,
I would think that UTF-8 can be quite reliably detected without the
"BOM".
I suppose you could construct short ambiguous sequences easily for
ISO-8859-[678] (which are meaningful in the corresponding natural
language), but it seems that even a couple dozen characters would make
the odds astronomical that "in the wild" syntactic UTF-8 is intended
to be UTF-8 Unicode (assuming you're expecting a text file, such as
Python source). Is that wrong? Have you any examples? I'd be
interested to see them; we (XEmacs) have some ideas about
"statistical" autodetection of encodings, and they'd be useful test
cases.
Martin> as the Web has demonstrated.
But the Web in general provides (mandatory) protocols for identifying
content-type, yet I regularly see HTML files with incorrect http-equiv
meta elements, and XHTML with no encoding declaration containing Shift
JIS. Microsoft software for Japanese apparently ignores Content-Type
headers and the like in favor of autodetection (probably because the
same MS software regularly relies on users to set things like charset
parameters in MIME Content-Type).
I can't tell my boss that his mail is ill-formed (well, not to any
effect). So I'd really love to be able to watch his face when Python
2.3 tells him his program is not legally encoded.
But I guess that's not convincing enough reason for Guido to mandate
UTF-8.<wink>
--
Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Don't ask how you can "do" free software business;
ask what your business can "do for" free software.