Filenames with accented characters

Tue Nov 18 14:47:00 GMT 2003

Hi Ranjit,
please do not take what I am about to write as a personal attack. Be 
sure that I am only trying to help (although I may be failing to do so...).
IMHO your post is quite useful for this discussion. And BTW I also do 
appreciate very much everything you have done for gcj/mingw...
It is also my objective in this post to get some light over certain 
things... If someone is willing to give me such a light (even if it is 
in solid form ;-) -- it would be better than simply getting no reply at 
all about those things)...
As you might all have noticed I am not a native English speaker/writer 
(and I am not good at it). And sometimes I get under the impression that 
my posts are misinterpreted (my fault, I apologize for that)... And I 
wonder if this is the reason for something so simple as character 
conversions on filenames to take many months to get solved in gcj/mingw 
(and is not yet implemented in the FSF sources AFAIK). But it must not 
be because of that, since there are several simple things related to 
mingw that do not make it to the FSF sources quickly...
Many people have said in past posts: "Solve it first, optimize later". 
And I agree with them! But I must be missing something here...
Ranjit Mathew wrote:
>Windows
>=======
>>This is, as usual, an ugly beast. The primary
>issue here is that Windows NT based OSs (NT4/2K/XP)
>have the notion of both a System Locale and a
>User Locale, which are almost, *but not quite*,
>of the same status.
>>>For what that matters in character conversions, you must be referring to:
1 - CP_ACP (for "old" win32 applications).
2 - CP_OEMCP (for "ancient" DOS applications).
If your compiler honors the runtime values of CP_ACP and CP_OEMCP, you 
can use them (it does not seem to be the case for gcc). Otherwise please 
use GetACP() or GetOEMCP().
The native character support in the Win32 NT-branch is based on the 
wchar_t C/C++ type using UTF-16 (LE) encoding. This includes the API 
W-functions...
(And AFAIK gcj Java-Strings are also based in wchar_t C/C++ type using 
UTF-16 (LE) encoding).
>Specifically, console applications can only
>display those glyphs that are supported by
>the character set of the System Locale, irrespective
>of what the User Locale is set to. GUI applications
>fortunately do not share this problem.
>>>The rxvt terminal emulation used with Msys seems to have a "bug" and 
seems to use CP_ACP (it should use CP_OEMCP to be equivalent to the 
windows console).
I would guess that, if your application would implement the same "bug", 
you would have your console problem solved.
>This problem is not visible for Western European
>languages, but is quite prominent for East Asian
>languages like Japanese, Chinese and Korean.
>>>If you are referring to the CP_OEMCP problem, it is visible for Western 
European languages too (I am Portuguese so I am well aware of it)... 
only less visible (because these languages share the first 127 codes 
with cp437 -- this first 127 codes are also shared with the Unicode 
standard). But if you are using English then the problem is not visible 
at all...
>Inspite of the above, applications must still honour
>the User Locale and list the above as a known
>limitation of the OS itself.
>>>Does this situation result from an implementation choice or from a 
*real* implementation limitation?
How deep is your control over your application and over the available 
console?
(see suggestion above)
>To get the user locale, you have to call GetLocaleInfo( )
>Win32 method with LOCALE_USER_DEFAULT as the first
>parameter.
>>>This is an overkill for character conversions... You would only need to 
use CP_ACP (or GetACP()) for character conversions.
>Note that we haven't used wchar_t at all - on Solaris,
>this seems to be of an unknown encoding (UCS-4) that
>also seems to vary with Solaris releases, on Windows it
>is very likely UTF-16 (LE).
>>I have to call your attention for some facts here... again, please do 
not take this personally. ;-)
1 - wchar_t is NOT an encoding. It is a C/C++ type (I am sure you are 
aware of this, but please do not compare it to UTF-16 because the later 
is an encoding -- we can implement UTF-8 encoding using wchar_t as 
supporting type, although this is not a great idea...).
2 - all windows API W-functions that I can recall use wchar_t at some 
point (this must include WriteConsoleW())...
It is also a fact that you should be able to convert directly form 
char-type UTF-8 to (and from) wchar_t-type UTF-16 using W-functions (or 
your own functions). You should not need to know CP_ACP or CP_OEMCP to 
do this in the win32 NT-branch!!! Search the win32 documentation at MS 
and you should find the W-functions you need (but IMHO there is nothing 
wrong with your previous strategy -- it can even be useful in some cases)...
And now trying to return to the topics of this mailing list (or should I 
say: "for something completely different" ;-) )...
GCJ would only benefit of a *full* wchar_t implementation in the IO 
library for two reasons:
1- speed (negligible IMHO, because this is not the limiting factor).
2- support for new languages that are only available in UTF-16 (LE) 
(this is also negligible at the moment IMHO).
(I also do not see why Win95/98/Me should be neglected form support if 
it can easily be done -- it is also the easiest way to support both 
branches --9X/Me and NT -- as things stand right now).
But Mohan has the "Binary Power" in this matter as far as I know, so 
it's Mohan's rules... ;-)
Unless someone with "Source-code Power" (I don't know who s/he might be) 
is willing to accept a simple char-based patch to the FSF sources... or 
maybe someone in the mingw project is willing to accept the patch for 
their binary distribution.
I CAN NOT understand why something that can be so simple gets so 
complicated...
I rest my case.
JoÃ£o