[1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8

Jason Pyeron jpyeron@pdinc.us
Wed May 13 15:39:00 GMT 2009


Corinna Vinschen wrote on Wednesday, May 13, 2009 10:30:
> On May 12 19:37, Corinna Vinschen wrote:
>> On May 13 02:29, IWAMURO Motonori wrote:
>>> I propose that the filename encoding in C locale uses UTF-8 instead
>>> of SO/UTF-8. 
>>>>>> There are three reasons:
>>>> That's an interesting thought. Do you have a patch and, if so, did
>> you try it? Does it, for instance, help for the issue reported in
>> the thread starting at
> http://cygwin.com/ml/cygwin/2009-05/msg00245.html?
>> After examining the issue Lenik reported in the above thread,
> I'm at a loss how to solve this problem in a generic way.
>
I may be dense, as all of my internationlization experience was from the late
90's. But in my experience the only solution for this is a cognizant effort on
behalf of the user (or admin).
> The problem is that the filename changes dependent on the
> character set used in $LANG. The reason is that every time a
> multibyte filename has to be generated, it has to be
> converted from UTF-16 to multibyte.
>> For instance, taking one of the filename from Lenik's
> example. It's stored on the filesystem as the UTF-16
> sequence \u684c \u9762. If I set LANG to en_US.UTF-8, a
> readdir(2) call returns the multibyte sequence
>> 0xe6 0xa1 0x8c 0xe9 0x9d 0xa2
>> If I set LANG to en_US.GBK, `ls' returns the filename
>> 0xd7 0xc0 0xc3 0xe6
>> And in case LANG=C, `ls' returns
>> 0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2
>> So, dependent on the character set setting in the
> application, the idea of the filename differs. That's not
> exactly helpful for interoperability between different applications.
>> I can think of two potential solutions to fix this problem:
>> (1) Always return filenames in UTF-8 encoding and pretend that UTF-8
> is the way files are stored on disk. That results in unchangable
> filenames which are always valid.
>> But what if an application sets LANG="xxxx.SJIS" and
> tries to create
> a file using SJIS character encoding? Should the file be created
> using the SJIS->UTF-16 conversion or should open fail with
> EILSEQ? That's not good. 
>> (2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then
> Cygwin uses the LC_CTYPE setting which corresponds to the current
> codepage. If one of $LC_ALL/$LC_CTYPE/$LANG is set in
> the environment,

If nothing is set use UTF-8 as it will work in existing code.
> Cygwin uses that to convert pathnames. If the application uses
> setlocale, Cygwin uses that setting to convert pathnames.
>> One problem can't be solved this way: If an application fetches
> and stores a filename, then switches the locale, and then tries
> to use the filename in another system call, the filename is 
> potentially broken. 

This is the user's problem to resolve.
>> Any better ideas?
>
Not necessarily better, but here is a chart:
Sys:	App:	function expects/returns
NULL:	NULL:	UTF-8
C/UA:	NULL:	UTF-8
NULL:	C/UA:	UTF-8
C/UA:	C/UA:	UTF-8
SPEC:	NULL:	System Locale
SPEC:	C/UA:	UTF-8
NULL	SPEC:	Application Locale
C/UA:	SPEC:	Application Locale
SPEC:	SPEC:	Application Locale
Key:
Sys= System's current locale
App= Application's current locale
NULL= No setting
C/UA= C or any Unicode aware locale
SPEC= Some other locale (i.e. SJIS)
-jason
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
- -
- Jason Pyeron PD Inc. http://www.pdinc.us -
- Principal Consultant 10 West 24th Street #100 -
- +1 (443) 269-1555 x333 Baltimore, Maryland 21218 -
- -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
This message is copyright PD Inc, subject to license 20080407P00.
--
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Problem reports: http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ: http://cygwin.com/faq/


More information about the Cygwin mailing list

AltStyle によって変換されたページ (->オリジナル) /