(You can skip the details to the last couple of lines if you're able to answer the question :) )
I'm on an Ubuntu 12.04. I'm trying to resolve an old issue that I've posted about in the past (if you're curious: https://superuser.com/questions/339877/trouble-viewing-files-with-non-english-names-on-hard-disk/339895#339895). There is a known compatibility issue between Linux, Mac, HFS+ and Korean-named files, and I spent all day today trying to finally find some kind of workaround.
Basically, I've mounted my HFS+ drive onto linux. Normal ls and cd have trouble accessing the files, because they are in Korean. So I wrote a C program to try to access these files at the lowest level, so I can be more sure that nothing would be happening behind my back:
DIR* dp;
struct dirent *ep;
char* parent = "/media/external/Movies";
dp = opendir( parent );
if( dp != NULL )
{
while( ep = readdir(dp) )
{
printf( "%d %s %X\t", ep->d_ino, ep->d_name, ep->d_type );
// now print out the filenames in hex
for( int i = 0; i != strlen( ep->d_name ) ; i++)
{
printf( "0x%X " , ep->d_name[i] & 0xff );
}
printf("\n");
}
closedir(dp);
}
else
{
perror("Couldn't open the directory! ");
}
Here's a sample of the output I get for this:
433949 밀양 4 0xEB 0xB0 0x80 0xEC 0x96 0x91
413680 박쥐 4 0xEB 0xB0 0x95 0xEC 0xA5 0x90
434033 박하사탕 4 0xEB 0xB0 0x95 0xED 0x95 0x98 0xEC 0x82 0xAC 0xED 0x83 0x95
So on the surface, it looks like openddir has no problem viewing the directory entries. The inode numbers are there, they are correctly marked as directories (4 means directory) and it appears that the filenames are stored as UTF-8 encoded, since those hexadecimals are the correct UTF-8 codes for the korean filenames. But now if I were to do a readdir of one of these directories (and I'll be using the filename in hex to be extra careful that nothing's happening behind my back):
unsigned char new_dirname[] = {'/',0xEB,0xB0,0x80,0xEC,0x96,0x91,'0円'};
unsigned char final[ strlen(parent) + strlen(new_dirname) + 1 ];
memcpy(final, parent, strlen( parent ));
strcpy(final + strlen(parent), dirname );
dp = opendir( final ); // dp == NULL here!!!
It is not able to open the directory. This befuddles me because if opendir was just reporting the raw bits of the file name in the directory entry, and readdir was just taking my given filename and matching it with the correct directory entry, then I would've thought there should be no problem in finding the inode and opening the directory. This seems to suggest that opendir is not being completely honest about the filenames.
Are the file names in the directory entries reported by opendir not what's actually on disk (i.e. are they being encoded)? If so is there any way that I can either control how opendir and readdir are encoding names, or perhaps use some other system calls that works with raw bytes instead of encoding stuff behind my back? In general, I find it very confusing at what level encoding is happening and I'd appreciate any explanations or better yet, a reference to understand this! Thanks!
1 Answer 1
opendir
and readdir
themselves work on bytes. They do not perform and reencoding.
Some filesystem drivers may impose contraints on the byte sequences. For example, HFS+ normalizes file names using a proprietary Unicode normalization scheme. I would expect the form returned by readdir
to work when passed to opendir
, however, so like the OP in the Ubuntu forum thread that jw013 mentioned, I suspect a bug in the HFS+ driver. It is not the only program that is tripped by Hangul on HFS+. Even OSX seems to have trouble with Unicode normalization.
-
Thanks for the answer. You say that some drivers are imposing constraints on the byte sequences. This is happening at a lower level than opendir and readdir, and so it seems like what you're saying is that the filename bytes in the struct dirents reported by readdir are not identical to what's actually on disk. Either that or the bytes that we give opendir are not the actual bytes that are checked on disk. In other words, something is still being done behind our backs, perhaps at the lower driver level. Am I right about this?bhh1988– bhh19882012年07月20日 09:15:22 +00:00Commented Jul 20, 2012 at 9:15
-
@bhh1988 Yes, something is done at the driver level, because the HFS+ filesystem doesn't accept arbitrary byte sequences and has a mandatory way of converting Unicode sequences into a canonical representation. It looks like the driver isn't doing this correctly, but I don't understand the details; I'm not familiar with HFS+.Gilles 'SO- stop being evil'– Gilles 'SO- stop being evil'2012年07月20日 09:25:39 +00:00Commented Jul 20, 2012 at 9:25
-
Gilles, I'd like to know how you were able to find out that opendir and readdir do not perform any reencoding?bhh1988– bhh19882012年07月22日 13:03:53 +00:00Commented Jul 22, 2012 at 13:03
-
@bhh1988 You can trace it through the C library source and the kernel code. If you run
strace ls
, you can directly start from the kernel entry point: theopen
syscall. The generic filesystem support code passes all bytes other than null and/
along unmodified. It's only some filesystem drivers, includinghfsplus
, that transforms file names.Gilles 'SO- stop being evil'– Gilles 'SO- stop being evil'2012年07月22日 17:09:43 +00:00Commented Jul 22, 2012 at 17:09
You must log in to answer this question.
Explore related questions
See similar questions with these tags.
ls
andcd
?dirname
andnew_dirname
, so it's not obvious that it is correct.