Python unicode and Windows cmd.exe

Alf P. Steinbach alfps at start.no
Sun Mar 14 20:37:04 EDT 2010


* Mark Tolonen:
>> "Terry Reedy" <tjreedy at udel.edu> wrote in message 
> news:hnjkuo$n16$1 at dough.gmane.org...
> On 3/14/2010 4:40 PM, Guillermo wrote:
>> Adding the byte that some call a 'utf-8 bom' makes the file an invalid 
>> utf-8 file.
>> Not true. From http://unicode.org/faq/utf_bom.html:
>> Q: When a BOM is used, is it only in 16-bit Unicode text?
> A: No, a BOM can be used as a signature no matter how the Unicode text 
> is transformed: UTF-16, UTF-8, UTF-7, etc. The exact bytes comprising 
> the BOM will be whatever the Unicode character FEFF is converted into by 
> that transformation format. In that form, the BOM serves to indicate 
> both that it is a Unicode file, and which of the formats it is in. 
> Examples:
> BytesEncoding Form
> 00 00 FE FF UTF-32, big-endian
> FF FE 00 00 UTF-32, little-endian
> FE FF UTF-16, big-endian
> FF FE UTF-16, little-endian
> EF BB BF UTF-8

Well, technically true, and Terry was wrong about "There is no such thing as a 
utf-8 'byte order mark'. The concept is an oxymoron.". It's true that as a 
descriptive term "byte order mark" is an oxymoron for UTF-8. But in this 
particular context it's not a descriptive term, and it's not only technically 
allowed, as you point out, but sometimes required.
However, some tools are unable to process UTF-8 files with BOM.
The most annoying example is the GCC compiler suite, in particular g++, which in 
its Windows MinGW manifestation insists on UTF-8 source code without BOM, while 
Microsoft's compiler needs the BOM to recognize the file as UTF-8 -- the only 
way I found to satisfy both compilers, apart from a restriction to ASCII or 
perhaps Windows ANSI with wide character literals restricted to ASCII 
(exploiting a bug in g++ that lets it handle narrow character literals with 
non-ASCII chars) is to preprocess the source code. But that's not a general 
solution since the g++ preprocessor, via another bug, accepts some constructs 
(which then compile nicely) which the compiler doesn't accept when explicit 
preprocessing isn't used. So it's a mess.
Cheers,
- Alf


More information about the Python-list mailing list

AltStyle によって変換されたページ (->オリジナル) /