readdir() returns inaccessible name if file was created with invalid UTF-8

Corinna Vinschen corinna-cygwin@cygwin.com
Wed Jun 25 19:43:56 GMT 2025


On Jun 25 16:59, Christian Franke via Cygwin wrote:
> On 2024年9月15日 19:47:11 +0200, Christian Franke wrote:
> > If a file name contains an invalid (truncated) UTF-8 sequence, open()
> > does not refuse to create the file. Later readdir() returns a different
> > name which could not be used to access the file.
> > 
> > Testcase with U+1F321 (Thermometer):
> > 
> > $ uname -r
> > 3.5.4-1.x86_64
> > 
> > $ printf $'\U0001F321' | od -A none -t x1
> >  f0 9f 8c a1
> > 
> > $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext'
> > 
> > $ touch 'file2-'$'\xf0\x9f\x8c''.ext'
> > 
> > $ touch 'file3-'$'\xf0\x9f\x8c'
> > 
> > $ ls -1
> > ls: cannot access 'file2-.?ext': No such file or directory
> > ls: cannot access 'file3-': No such file or directory
> > 'file1-'$'360円237円214円241円''.ext'
> > file2-.?ext
> > file3-
> > 
> > 
> > Name mapping according to "fhandler_disk_file::readdir" strace lines:
> > 
> > "file1-\xF0\x9F\x8C\xA1.ext" -(open)-> L"file1-\xD83C\xDF21.ext"
> > -(readdir)->
> > "file1-\xF0\x9F\x8C\xA1.ext"
> > 
> > "file2-\xF0\x9f\x8C.ext" -(open)-> L"file2-\xD83C\xF02Eext" -(readdir)->
> > "file2-.\xE1\x9E\xB3ext"
> > 
> > "file3-\xF0\x9F\x8C" -(open)-> L"file3-\xD83C\xF000" -(readdir)->
> > "file3-"

I don't know exactly where this happens, but the input of the
conversion is invalid UTF-8 because it's missing the 4th byte.
There's no way to represent these filenames on Windows
filesystems storing filenames as UTF-16 values.
So the problem here is that the conversion somehow misses that
the 4th byte is invalid and just plods forward and converts the
leading three bytes into the matching high surrogate value and
then stumbles over the conversion for the low surrogate.
It would be really helpful to have an STC for this problem.
Thanks,
Corinna


More information about the Cygwin mailing list

AltStyle によって変換されたページ (->オリジナル) /