readdir() returns inaccessible name if file was created with invalid UTF-8

Tue Jul 22 03:38:10 GMT 2025

Am 27.06.2025 um 12:30 schrieb Corinna Vinschen via Cygwin:
> Hi Christian,
>> On Jun 26 19:07, Christian Franke via Cygwin wrote:
>> Corinna Vinschen via Cygwin wrote:
>>> On Jun 25 16:59, Christian Franke via Cygwin wrote:
>>>> On 2024年9月15日 19:47:11 +0200, Christian Franke wrote:
>>>>> If a file name contains an invalid (truncated) UTF-8 sequence, open()
>>>>> does not refuse to create the file. Later readdir() returns a different
>>>>> name which could not be used to access the file.
>>>>>>>>>> Testcase with U+1F321 (Thermometer):
>>>>>>>>>> $ uname -r
>>>>> 3.5.4-1.x86_64
>>>>>>>>>> $ printf $'\U0001F321' | od -A none -t x1
>>>>>  f0 9f 8c a1
>>>>>>>>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext'
>>>>>>>>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext'
>>>>>>>>>> $ touch 'file3-'$'\xf0\x9f\x8c'
>>>>>>>>>> $ ls -1
>>>>> ls: cannot access 'file2-.?ext': No such file or directory
>>>>> ls: cannot access 'file3-': No such file or directory
>>>>> 'file1-'$'360円237円214円241円''.ext'
>>>>> file2-.?ext
>>>>> file3-
>>>>> [...]
>>> I don't know exactly where this happens, but the input of the
>>> conversion is invalid UTF-8 because it's missing the 4th byte.
>>> There's no way to represent these filenames on Windows
>>> filesystems storing filenames as UTF-16 values.
>>>>>> So the problem here is that the conversion somehow misses that
>>> the 4th byte is invalid and just plods forward and converts the
>>> leading three bytes into the matching high surrogate value and
>>> then stumbles over the conversion for the low surrogate.
>>>>>> It would be really helpful to have an STC for this problem.
>> With some trial and error I found a testcase for this more serious problem
>> reported yesterday but not quoted above:
>>>>>> In cases like file3-... above, the converted Windows path ends with
>>>> 0xF000. This suggests that this is an accidental conversion of the
>>>> terminating null to the 0xF0xx range.
>>>>>>>> In some cases, the created Windows file name has random garbage
>>>> behind the 0xF000. Then even Cygwin is not able to access or unlink
>>>> the file after creation.
>> Testcase (attached):
> Thanks for the testcase!
>> I found the problem in the newlib core function creating wchar_t from
> UTF-8 input. In case of 4 byte UTF-8 sequences, the code created the
> low surrogate already after reading byte 3, without checking if byte 4
> of the UTF-8 sequence is a valid byte. Hilarity ensues.
I'm afraid the fix may have broken mbrtowc as I just reported to the 
list, with a test case, thus also breaking mintty.
The low surrogate MUST be created after byte 3 because otherwise the 
high surrogate cannot be delivered after byte 4 as it needs to.
I think it's a drawback of UTF-16 that must be swallowed, even if some 
incorrect sequences slip through somehow.
Thomas
> Fortunately this bug has only been introduced very recently, to wit, on
> 2009年03月24日, a mere 16 years ago. And it is my bug and mine alone :}
>> I'm just prep'ing a fix which I'll push in a minute or two.
>>> Thanks,
> Corinna
>