Comparing the files in two directories (recursively) when the file structure has changed

Question 1

I have accidentally deleted part of folder (before stopping the rm command). However, the backup I restored was around 2 weeks old, and unfortunately, I've renamed and restructured the directories between deleting them and the point in time of the backup. I have manually restored what I know was missing, but I'm not sure I managed to catch everything.

Is there a fast method of showing file differences that do not include their parent directories, only file name and modification or creation date? For example, I have the directories

data/output/test1/file1.mha

which I might have moved/renamed to

data/results/mhas/first_test/file1.mha

Using diff -rq did not work for this and is also rather slow. The directory has a size of around 2TB and a fairly large number of files, so checking the MD5 of every files is barely an option.

To clarify a little bit, after restoring the backup, I have:

/data_backup_restore/output/test1/file1.mha

and

/data/results/mhas/first_test/file1.mha

since the restored backup still uses the 'old' directory structure. I've changed it because it was a mess, but I haven't written down all changes/renames I've done, since there were a lot of them.
I would consider both of the above the same if filesize, modification date and filename match.

Question 2

To clarify: you have one directory with files and another directory with files. And you want to compare them?

Question 3

I'd probably start with an rsync dry run

Question 4

@RomeoNinov Yes, but the files are in different subdirectories

Question 5

@TomYan Sorry, forgot to mention in the original post that I've also tried that with rsync -navi, and while it was really fast, I could not get it to compare just the files while ignoring the path to the file

Question 6

OK, do they have same names: compare file1 with file? And do you want to check if the file exist in second directory or if the same file exist in second directory?

Question 7

If I understand correctly you want to compare the two directories recursively, but ignoring the directory structure, so basically if you find two files in the two trees having the same filename, creation/modification time and size (you don't mention size, but I guess it will be also useful), then treat them as the same, even if they are at different positions in the two directory trees.

If this is correct, you can create a list of files with size, time and filename like this:

ls -lR --time-style=long-iso /data/output/ | grep ^- | tr -s ' ' | cut -d' ' -f5- | sort -k 4 >files_output.txt
ls -lR --time-style=long-iso /data/results/ | grep ^- | tr -s ' ' | cut -d' ' -f5- | sort -k 4 >files_results.txt

And then compare the two lists, either with diff or some GUI like meld.

Details:

Using --time-style=long-iso to avoid locale specific peculiarities that might break the following pipes.
grep ^- to only select the actual files, ignore directories and possibly other special files. Depending on your use case, you might want to add more here, e.g. symbolic links...
tr -s ' ' will squeeze multiple consecutive spaces for the following cut to work correctly in all cases.
cut columns beginning from column 5 (the file size)
sort to make the comparison later work. -k 4 is not really necessary, as long as you are consistent in the two commands. -k 4 will sort by filename which may be useful.

After you compare the two files and find differences, you will of course have to locate the file in the original directory tree, you can use find for this.

Update

Based on your comments, if you want to find the full paths for filenames that appear many times, you can do the following:

First get the list of files that are missing in your second directory, e.g. like this:

comm -1 -3 file_outputs.txt file_results.txt >missing_files.txt

Then, for each missing file use find to find the full path of the specific file:

cat missing_files.txt | while read size date time name
do
 find . -name "$name" -size ${size}c -newermt "$date $time" ! -newermt "$date $time +0000 +1 minutes"
done

Now notice that this is just a simple example and not optimal, and depending on the number of missing files it will call find that many times which can be slow if the directories are big as you indicated. In that case you should try to optimize it somehow (e.g. make a list of all files similar to the ls -lR but containing the full paths, and try to match that list with the list you found in the missing.txt file).

Question 8

And what about same filename, same size (you do not compare sizes) but different content?

Question 9

Definitely, this will not catch these. My understanding is that the OP did not want to compare content because of the big size "The directory has a size of around 2TB and a fairly large number of files, so checking the MD5 of every files is barely an option". Even without doing MD5, comparing the content would be slow (that's why the diff -rq from the OP was also slow, since even with the -q it has to compare the whole files as long as they are identical, which they will mostly be in this use case).

Question 10

IMHO the situation is opposite, hashing will work faster. And will be applicable for binary files too.

Question 11

Faster than diff or faster than ls? I doubt it will be faster than diff -q which only compares whether files differ or not, but that will depend on the actual implementation of diff -q (which btw also compares binary files). But my point was, that both diff and md5 would be too slow to the OP, since they would have to scan through the whole 2TB disk.

Question 12

This worked well and really fast, my only problem is that now I don't know where the files that are missing are located, since a lot of files have the same name. Is there a way to also print the full path to each file but ignore it while comparing the original outputs?

Question 13

To compare file contents, you could use the following commands:

find FolderA -type f -print0 | xargs -0 cksum > FoldA.cksum
find FolderB -type f -print0 | xargs -0 cksum > FoldB.cksum

You may sort the two files together. As the first two fields are checksum and size, you may ignore groups of two that have the same checksum and size. Groups of one denote a missing file in one folder.

Source : Compare large directories recursively - but ignoring sub-directories - compare two backups - with gui.

Question 14

IMHO CRC (default for above command) is quite weak as hash and you may see good amount of collisions.

Question 15

@RomeoNinov: It's easy enough to test. It would be interesting if the poster could do some comparative tests. I never saw checksum collisions, but I haven't used it extensively.

Question 16

I've tried both this and the the sha1sum version, but both seem to be far too slow and were hogging too many resources, so I had to stop them. However, it's 'only' 155k files in total, so a collision seems unlikely to me? I'm sorry if my original post made it seem like much more, it's just that most things I had tried did take surprisingly much longer than ls and I didn't know how many files there were myself.

Question 17

One possible way is to use hashes:

cd /directory1
sha1sum * **/* >/tmp/sum
cd /directory2
sha1sum -c /tmp/sum

The odd construction **/* is to search in subdirectories (globbing should be enabled) this will generate hash of files in first directory and check it with the hash in second directory with indication about OK file and missing/mismatch hash:

#a/aa: OK
rr: OK
zzz: FAILED
sha1sum: WARNING: 1 of 3 computed checksums did NOT match

P.S. Do not be afraid of using hash functions, they are quite fast

Question 18

Thanks, something like this should work. I assume sha1sum is faster than cksum suggested harrymc?

Question 19

@muffinname, I am not sure, but AFAIK the default hash of cksum is CRC which create more probabilities for collision (different files with same hash). W/o tests I am 99% sure cksum with CRC is faster than sha1sum :)

gepa 1,3111 gold badge5 silver badges11 bronze badges · Accepted Answer · 2023-03-06 10:22:47Z

If I understand correctly you want to compare the two directories recursively, but ignoring the directory structure, so basically if you find two files in the two trees having the same filename, creation/modification time and size (you don't mention size, but I guess it will be also useful), then treat them as the same, even if they are at different positions in the two directory trees.

If this is correct, you can create a list of files with size, time and filename like this:

ls -lR --time-style=long-iso /data/output/ | grep ^- | tr -s ' ' | cut -d' ' -f5- | sort -k 4 >files_output.txt
ls -lR --time-style=long-iso /data/results/ | grep ^- | tr -s ' ' | cut -d' ' -f5- | sort -k 4 >files_results.txt

And then compare the two lists, either with diff or some GUI like meld.

Details:

Using --time-style=long-iso to avoid locale specific peculiarities that might break the following pipes.
grep ^- to only select the actual files, ignore directories and possibly other special files. Depending on your use case, you might want to add more here, e.g. symbolic links...
tr -s ' ' will squeeze multiple consecutive spaces for the following cut to work correctly in all cases.
cut columns beginning from column 5 (the file size)
sort to make the comparison later work. -k 4 is not really necessary, as long as you are consistent in the two commands. -k 4 will sort by filename which may be useful.

After you compare the two files and find differences, you will of course have to locate the file in the original directory tree, you can use find for this.

Update

Based on your comments, if you want to find the full paths for filenames that appear many times, you can do the following:

First get the list of files that are missing in your second directory, e.g. like this:

comm -1 -3 file_outputs.txt file_results.txt >missing_files.txt

Then, for each missing file use find to find the full path of the specific file:

cat missing_files.txt | while read size date time name
do
 find . -name "$name" -size ${size}c -newermt "$date $time" ! -newermt "$date $time +0000 +1 minutes"
done

Now notice that this is just a simple example and not optimal, and depending on the number of missing files it will call find that many times which can be slow if the directories are big as you indicated. In that case you should try to optimize it somehow (e.g. make a list of all files similar to the ls -lR but containing the full paths, and try to match that list with the list you found in the missing.txt file).

And what about same filename, same size (you do not compare sizes) but different content?
Definitely, this will not catch these. My understanding is that the OP did not want to compare content because of the big size "The directory has a size of around 2TB and a fairly large number of files, so checking the MD5 of every files is barely an option". Even without doing MD5, comparing the content would be slow (that's why the diff -rq from the OP was also slow, since even with the -q it has to compare the whole files as long as they are identical, which they will mostly be in this use case).
IMHO the situation is opposite, hashing will work faster. And will be applicable for binary files too.
Faster than diff or faster than ls? I doubt it will be faster than diff -q which only compares whether files differ or not, but that will depend on the actual implementation of diff -q (which btw also compares binary files). But my point was, that both diff and md5 would be too slow to the OP, since they would have to scan through the whole 2TB disk.
This worked well and really fast, my only problem is that now I don't know where the files that are missing are located, since a lot of files have the same name. Is there a way to also print the full path to each file but ignore it while comparing the original outputs?

Stack Exchange Network

Comparing the files in two directories (recursively) when the file structure has changed

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Comparing the files in two directories (recursively) when the file structure has changed

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions