readdir() returns inaccessible name if file was created with invalid UTF-8
Thomas Wolff
towo@towo.net
Wed Jul 23 03:44:23 GMT 2025
Am 23.07.2025 um 04:25 schrieb Thomas Wolff via Cygwin:
>>> Am 22.07.2025 um 17:09 schrieb Thomas Wolff via Cygwin:
>>>>>> Am 22.07.2025 um 15:05 schrieb Corinna Vinschen:
>>> On Jul 22 05:38, Thomas Wolff via Cygwin wrote:
>>>> Am 27.06.2025 um 12:30 schrieb Corinna Vinschen via Cygwin:
>>>>> On Jun 26 19:07, Christian Franke via Cygwin wrote:
>>>>>> With some trial and error I found a testcase for this more
>>>>>> serious problem
>>>>>> reported yesterday but not quoted above:
>>>>>>>>>>>>>> In cases like file3-... above, the converted Windows path ends
>>>>>>>> with
>>>>>>>> 0xF000. This suggests that this is an accidental conversion of the
>>>>>>>> terminating null to the 0xF0xx range.
>>>>>>>>>>>>>>>> In some cases, the created Windows file name has random garbage
>>>>>>>> behind the 0xF000. Then even Cygwin is not able to access or
>>>>>>>> unlink
>>>>>>>> the file after creation.
>>>>>> Testcase (attached):
>>>>> Thanks for the testcase!
>>>>>>>>>> I found the problem in the newlib core function creating wchar_t from
>>>>> UTF-8 input. In case of 4 byte UTF-8 sequences, the code created the
>>>>> low surrogate already after reading byte 3, without checking if
>>>>> byte 4
>>>>> of the UTF-8 sequence is a valid byte. Hilarity ensues.
>>>> I'm afraid the fix may have broken mbrtowc as I just reported to
>>>> the list,
>>>> with a test case, thus also breaking mintty.
>>>> The low surrogate MUST be created after byte 3 because otherwise
>>>> the high
>>>> surrogate cannot be delivered after byte 4 as it needs to.
>>>> I think it's a drawback of UTF-16 that must be swallowed, even if some
>>>> incorrect sequences slip through somehow.
>>> Bummer. What bugs me most is that you might be right here. It's a bit
>>> late, but we should have defined wchar_t as a 4 byte type back when we
>>> worked on Cygwin 1.7.0... sigh.
>>>>>> mbrtowc() is inherently a bad idea when it comes to UTF-16. It's a
>>> function which only works really correctly for the unicode base plane,
>>> or if wchar_t is big enough.
>>>>>> It's the reason we don't use mbrtowc() if possible. It's better to
>>> call
>>> mbstowcs() or friends and allow at least 3 chars in the wchar_t buffer.
>>> You can't change that in mintty by any chance?
>> Well, I've started to think about a workaround but it's code I've
>> never touched before and I'd need to carefully ponder about all kinds
>> of possible special situations, so my testing effort would be high.
>> Also, I'd need to implement bytewise mbr collection which is right
>> now done by that function.
>> Since not using mbrtowc anymore would leave it still broken (and what
>> other software may fall into that trap...), I'd prefer a fix of that
>> function anyway.
> I've checked whether to use the old version of mbrtowc from newlib
> directly in mintty but it pulls too many dependencies...
> I've also checked whether to use _mbrtowc_r instead which is defined
> in wchar.h but it does not link.
> By the way, discussion and commit log mix up the order: the high
> surrogate comes first.
>OK, suppose I'd consider to switch to mbs[[n]r]towcs, collecting bytes
until the function gives me a result.
This would work fine as long as I receive only valid sequences. But look
at input string test case
char nonbmp[] = {0xF8, 0x88, 0x8A, 0xAF, 0x2D, 0}; // an invalid
sequence followed by a valid char
The functions only return -1 and (in the case of mbsnrtowcs) do not
advance the input pointer.
So how am I supposed to recognize that the invalid sequence has ended
and a valid character has arrived?
>>>> Thomas
>>>>> Corinna
>>>>>>
More information about the Cygwin
mailing list