readdir() returns inaccessible name if file was created with invalid UTF-8

Wed Jul 23 03:44:23 GMT 2025

Am 23.07.2025 um 04:25 schrieb Thomas Wolff via Cygwin:
>>> Am 22.07.2025 um 17:09 schrieb Thomas Wolff via Cygwin:
>>>>>> Am 22.07.2025 um 15:05 schrieb Corinna Vinschen:
>>> On Jul 22 05:38, Thomas Wolff via Cygwin wrote:
>>>> Am 27.06.2025 um 12:30 schrieb Corinna Vinschen via Cygwin:
>>>>> On Jun 26 19:07, Christian Franke via Cygwin wrote:
>>>>>> With some trial and error I found a testcase for this more 
>>>>>> serious problem
>>>>>> reported yesterday but not quoted above:
>>>>>>>>>>>>>> In cases like file3-... above, the converted Windows path ends 
>>>>>>>> with
>>>>>>>> 0xF000. This suggests that this is an accidental conversion of the
>>>>>>>> terminating null to the 0xF0xx range.
>>>>>>>>>>>>>>>> In some cases, the created Windows file name has random garbage
>>>>>>>> behind the 0xF000. Then even Cygwin is not able to access or 
>>>>>>>> unlink
>>>>>>>> the file after creation.
>>>>>> Testcase (attached):
>>>>> Thanks for the testcase!
>>>>>>>>>> I found the problem in the newlib core function creating wchar_t from
>>>>> UTF-8 input.  In case of 4 byte UTF-8 sequences, the code created the
>>>>> low surrogate already after reading byte 3, without checking if 
>>>>> byte 4
>>>>> of the UTF-8 sequence is a valid byte. Hilarity ensues.
>>>> I'm afraid the fix may have broken mbrtowc as I just reported to 
>>>> the list,
>>>> with a test case, thus also breaking mintty.
>>>> The low surrogate MUST be created after byte 3 because otherwise 
>>>> the high
>>>> surrogate cannot be delivered after byte 4 as it needs to.
>>>> I think it's a drawback of UTF-16 that must be swallowed, even if some
>>>> incorrect sequences slip through somehow.
>>> Bummer.  What bugs me most is that you might be right here. It's a bit
>>> late, but we should have defined wchar_t as a 4 byte type back when we
>>> worked on Cygwin 1.7.0... sigh.
>>>>>> mbrtowc() is inherently a bad idea when it comes to UTF-16. It's a
>>> function which only works really correctly for the unicode base plane,
>>> or if wchar_t is big enough.
>>>>>> It's the reason we don't use mbrtowc() if possible.  It's better to 
>>> call
>>> mbstowcs() or friends and allow at least 3 chars in the wchar_t buffer.
>>> You can't change that in mintty by any chance?
>> Well, I've started to think about a workaround but it's code I've 
>> never touched before and I'd need to carefully ponder about all kinds 
>> of possible special situations, so my testing effort would be high. 
>> Also, I'd need to implement bytewise mbr collection which is right 
>> now done by that function.
>> Since not using mbrtowc anymore would leave it still broken (and what 
>> other software may fall into that trap...), I'd prefer a fix of that 
>> function anyway.
> I've checked whether to use the old version of mbrtowc from newlib 
> directly in mintty but it pulls too many dependencies...
> I've also checked whether to use _mbrtowc_r instead which is defined 
> in wchar.h but it does not link.
> By the way, discussion and commit log mix up the order: the high 
> surrogate comes first.
>OK, suppose I'd consider to switch to mbs[[n]r]towcs, collecting bytes 
until the function gives me a result.
This would work fine as long as I receive only valid sequences. But look 
at input string test case
char nonbmp[] = {0xF8, 0x88, 0x8A, 0xAF, 0x2D, 0}; // an invalid 
sequence followed by a valid char
The functions only return -1 and (in the case of mbsnrtowcs) do not 
advance the input pointer.
So how am I supposed to recognize that the invalid sequence has ended 
and a valid character has arrived?
>>>> Thomas
>>>>> Corinna
>>>>>>