readdir() returns inaccessible name if file was created with invalid UTF-8

Wed Jul 23 02:25:36 GMT 2025

Am 22.07.2025 um 17:09 schrieb Thomas Wolff via Cygwin:
>>> Am 22.07.2025 um 15:05 schrieb Corinna Vinschen:
>> On Jul 22 05:38, Thomas Wolff via Cygwin wrote:
>>> Am 27.06.2025 um 12:30 schrieb Corinna Vinschen via Cygwin:
>>>> On Jun 26 19:07, Christian Franke via Cygwin wrote:
>>>>> With some trial and error I found a testcase for this more serious 
>>>>> problem
>>>>> reported yesterday but not quoted above:
>>>>>>>>>>>> In cases like file3-... above, the converted Windows path ends with
>>>>>>> 0xF000. This suggests that this is an accidental conversion of the
>>>>>>> terminating null to the 0xF0xx range.
>>>>>>>>>>>>>> In some cases, the created Windows file name has random garbage
>>>>>>> behind the 0xF000. Then even Cygwin is not able to access or unlink
>>>>>>> the file after creation.
>>>>> Testcase (attached):
>>>> Thanks for the testcase!
>>>>>>>> I found the problem in the newlib core function creating wchar_t from
>>>> UTF-8 input.  In case of 4 byte UTF-8 sequences, the code created the
>>>> low surrogate already after reading byte 3, without checking if byte 4
>>>> of the UTF-8 sequence is a valid byte. Hilarity ensues.
>>> I'm afraid the fix may have broken mbrtowc as I just reported to the 
>>> list,
>>> with a test case, thus also breaking mintty.
>>> The low surrogate MUST be created after byte 3 because otherwise the 
>>> high
>>> surrogate cannot be delivered after byte 4 as it needs to.
>>> I think it's a drawback of UTF-16 that must be swallowed, even if some
>>> incorrect sequences slip through somehow.
>> Bummer.  What bugs me most is that you might be right here. It's a bit
>> late, but we should have defined wchar_t as a 4 byte type back when we
>> worked on Cygwin 1.7.0... sigh.
>>>> mbrtowc() is inherently a bad idea when it comes to UTF-16. It's a
>> function which only works really correctly for the unicode base plane,
>> or if wchar_t is big enough.
>>>> It's the reason we don't use mbrtowc() if possible.  It's better to call
>> mbstowcs() or friends and allow at least 3 chars in the wchar_t buffer.
>> You can't change that in mintty by any chance?
> Well, I've started to think about a workaround but it's code I've 
> never touched before and I'd need to carefully ponder about all kinds 
> of possible special situations, so my testing effort would be high. 
> Also, I'd need to implement bytewise mbr collection which is right now 
> done by that function.
> Since not using mbrtowc anymore would leave it still broken (and what 
> other software may fall into that trap...), I'd prefer a fix of that 
> function anyway.
I've checked whether to use the old version of mbrtowc from newlib 
directly in mintty but it pulls too many dependencies...
I've also checked whether to use _mbrtowc_r instead which is defined in 
wchar.h but it does not link.
By the way, discussion and commit log mix up the order: the high 
surrogate comes first.
>> Thomas
>>> Corinna
>>