readdir() returns inaccessible name if file was created with invalid UTF-8
Corinna Vinschen
corinna-cygwin@cygwin.com
Thu Jul 24 14:08:49 GMT 2025
On Jul 24 15:41, Thomas Wolff via Cygwin wrote:
> Am 24.07.2025 um 12:30 schrieb Corinna Vinschen:
> > What does that mean? Consider this UTF8 input string:
> >
> > 0xf0 0x90 0x80 0x2e
> >
> > mbstowcs: returns -1
> > sys_mbstowcs: f0f0 f090 f080 002e
> >
> > Let's convert it back to multibyte:
> >
> > sys_wcstombs: 0xf0 0x90 0x80 0x2e
> > wcstombs: 0xef 0x83 0xb0 0xef 0x82 0x90 0xef 0x82 0x80 0x2e
> >
> > So while sys_wcstombs has special code converting the string back to its
> > original MB string, wcstombs converts to the CESU-8 representation.
> >
> > This is transparent. If we convert this CESU-8 string back to
> > wide-char, the resulting wide-char strings are the same:
> >
> > mbstowcs: f0f0 f090 f080 002e
> > sys_mbstowcs: f0f0 f090 f080 002e
> >
> > So the question here is, shall we keep the special case converting
> > private use area bytes back to their original byte encoding?
> >
> > Or shall simply go along with CESU-8 when converting back to multibyte
> > to keep the string the same as with wcstombs?
> >
> > Exempt from this are the characters not valid in a DOS filename.
> > These will always be converted if we create wide-char filenames.
> Sounds like a fair solution with only minor glitches. Poor 4th byte but
> thanks a lot anyway.
> About the latter decision, if there's no strong bias otherwise, I'd prefer
> to drop special handling (but don't take my vote, I don't care so much about
> that).
Thanks for your input.
As another datapoint we have to consider how sys_wcstombs is used.
wcstombs on a filename will be used by the application only, and only if
the filename is incoming application level data or has been converted to a
wide char by the application itself.
sys_wcstombs will be used to generate a readable multi-byte filename from
UTF-16 filenames read from the filesystem. So it's major use in terms of
filenames is by readdir().
Knowing that, the question boils down to this:
Do we want readdir() returning the same name as given to open(), or is
CESU-8 sufficent?
Corinna
More information about the Cygwin
mailing list