readdir() returns inaccessible name if file was created with invalid UTF-8

Thomas Wolff towo@towo.net
Thu Jul 24 17:43:22 GMT 2025


Am 24.07.2025 um 16:08 schrieb Corinna Vinschen:
> On Jul 24 15:41, Thomas Wolff via Cygwin wrote:
>> Am 24.07.2025 um 12:30 schrieb Corinna Vinschen:
>>> What does that mean? Consider this UTF8 input string:
>>>>>> 0xf0 0x90 0x80 0x2e
>>>>>> mbstowcs: returns -1
>>> sys_mbstowcs: f0f0 f090 f080 002e
>>>>>> Let's convert it back to multibyte:
>>>>>> sys_wcstombs: 0xf0 0x90 0x80 0x2e
>>> wcstombs: 0xef 0x83 0xb0 0xef 0x82 0x90 0xef 0x82 0x80 0x2e
>>>>>> So while sys_wcstombs has special code converting the string back to its
>>> original MB string, wcstombs converts to the CESU-8 representation.
>>>>>> This is transparent. If we convert this CESU-8 string back to
>>> wide-char, the resulting wide-char strings are the same:
>>>>>> mbstowcs: f0f0 f090 f080 002e
>>> sys_mbstowcs: f0f0 f090 f080 002e
>>>>>> So the question here is, shall we keep the special case converting
>>> private use area bytes back to their original byte encoding?
>>>>>> Or shall simply go along with CESU-8 when converting back to multibyte
>>> to keep the string the same as with wcstombs?
>>>>>> Exempt from this are the characters not valid in a DOS filename.
>>> These will always be converted if we create wide-char filenames.
>> Sounds like a fair solution with only minor glitches. Poor 4th byte but
>> thanks a lot anyway.
>> About the latter decision, if there's no strong bias otherwise, I'd prefer
>> to drop special handling (but don't take my vote, I don't care so much about
>> that).
> Thanks for your input.
>> As another datapoint we have to consider how sys_wcstombs is used.
>> wcstombs on a filename will be used by the application only, and only if
> the filename is incoming application level data or has been converted to a
> wide char by the application itself.
>> sys_wcstombs will be used to generate a readable multi-byte filename from
> UTF-16 filenames read from the filesystem. So it's major use in terms of
> filenames is by readdir().
>> Knowing that, the question boils down to this:
>> Do we want readdir() returning the same name as given to open(), or is
> CESU-8 sufficent?
You mean for "normal" cases (i.e. proper non-BMP characters, not invalid 
stuff or handled special or private range characters)?
In that case, I'd not expect or wish to handle CESU-8, as an application 
developer.
Thomas
>>> Corinna



More information about the Cygwin mailing list

AltStyle によって変換されたページ (->オリジナル) /