readdir() returns inaccessible name if file was created with invalid UTF-8

Thomas Wolff towo@towo.net
Thu Jul 24 17:45:16 GMT 2025


Am 24.07.2025 um 17:35 schrieb Corinna Vinschen:
> Thomas,
>> On Jul 23 05:44, Thomas Wolff via Cygwin wrote:
>>>> Am 22.07.2025 um 15:05 schrieb Corinna Vinschen:
>>>>> mbrtowc() is inherently a bad idea when it comes to UTF-16. It's a
>>>>> function which only works really correctly for the unicode base plane,
>>>>> or if wchar_t is big enough.
>>>>>>>>>> It's the reason we don't use mbrtowc() if possible.  It's better
>>>>> to call
>>>>> mbstowcs() or friends and allow at least 3 chars in the wchar_t buffer.
>>>>> You can't change that in mintty by any chance?
>>> [...]
>> OK, suppose I'd consider to switch to mbs[[n]r]towcs, collecting bytes until
>> the function gives me a result.
>> This would work fine as long as I receive only valid sequences. But look at
>> input string test case
>> char nonbmp[] = {0xF8, 0x88, 0x8A, 0xAF, 0x2D, 0}; // an invalid sequence
>> followed by a valid char
>> The functions only return -1 and (in the case of mbsnrtowcs) do not advance
>> the input pointer.
>> So how am I supposed to recognize that the invalid sequence has ended and a
>> valid character has arrived?
> Apart from that, you probably still have a problem in mintty: GB18030.
>> The problem with GB18030 is, that you need all four bytes to generate
> the high surrogate.
>> Consider the following GB18030 string: 0x90 0x30 0x81 0x30
>> This string translates into a UTF-16 surrogate pair: 0xd800 0xdc00.
>> If you run a tweaked version of your test applicaton from
> https://cygwin.com/pipermail/cygwin/2025-July/258513.html:
>> setlocale (LC_CTYPE, "zh_CN.gb18030");
> mb (0x90);
> mb (0x30);
> mb (0x81);
> mb (0x30);
>> Then the output is:
>> 90 -> 0000 : -2
> 30 -> 0000 : -2
> 81 -> 0000 : -2
> 30 -> D800 : 0
>> However, if you notice this situation...
>> if (ret_from_mbrtowc == 0 && codeset == gb18030
> && (pwc & 0xfc00) == 0xd800)
>> ...you can just add a fake NUL byte:
>> mbrtowc (&wc, '0円', 1, &mbstate);
>> If you do that, the above sequence becomes:
>> 90 -> 0000 : -2
> 30 -> 0000 : -2
> 81 -> 0000 : -2
> 30 -> D800 : 0
> 00 -> DC00 : 1
>> I hope this helps, if you didn't already handle GB18030 differently
> in mintty.
Oooff. No, I didn't. So that is already before 3.6.4 (and again 3.6.5), 
right?
Thanks for the notice, I'll check and test your workaround.
Thomas
> Corinna



More information about the Cygwin mailing list

AltStyle によって変換されたページ (->オリジナル) /