I have accidentally deleted part of folder (before stopping the rm command). However, the backup I restored was around 2 weeks old, and unfortunately, I've renamed and restructured the directories between deleting them and the point in time of the backup. I have manually restored what I know was missing, but I'm not sure I managed to catch everything.
Is there a fast method of showing file differences that do not include their parent directories, only file name and modification or creation date? For example, I have the directories
data/output/test1/file1.mha
which I might have moved/renamed to
data/results/mhas/first_test/file1.mha
Using diff -rq did not work for this and is also rather slow. The directory has a size of around 2TB and a fairly large number of files, so checking the MD5 of every files is barely an option.
To clarify a little bit, after restoring the backup, I have:
/data_backup_restore/output/test1/file1.mha
and
/data/results/mhas/first_test/file1.mha
since the restored backup still uses the 'old' directory structure. I've changed it because it was a mess, but I haven't written down all changes/renames I've done, since there were a lot of them.
I would consider both of the above the same if filesize, modification date and filename match.
3 Answers 3
If I understand correctly you want to compare the two directories recursively, but ignoring the directory structure, so basically if you find two files in the two trees having the same filename, creation/modification time and size (you don't mention size, but I guess it will be also useful), then treat them as the same, even if they are at different positions in the two directory trees.
If this is correct, you can create a list of files with size, time and filename like this:
ls -lR --time-style=long-iso /data/output/ | grep ^- | tr -s ' ' | cut -d' ' -f5- | sort -k 4 >files_output.txt
ls -lR --time-style=long-iso /data/results/ | grep ^- | tr -s ' ' | cut -d' ' -f5- | sort -k 4 >files_results.txt
And then compare the two lists, either with diff or some GUI like meld.
Details:
- Using
--time-style=long-isoto avoid locale specific peculiarities that might break the following pipes. grep ^-to only select the actual files, ignore directories and possibly other special files. Depending on your use case, you might want to add more here, e.g. symbolic links...tr -s ' 'will squeeze multiple consecutive spaces for the followingcutto work correctly in all cases.cutcolumns beginning from column 5 (the file size)sortto make the comparison later work.-k 4is not really necessary, as long as you are consistent in the two commands.-k 4will sort by filename which may be useful.
After you compare the two files and find differences, you will of course have to locate the file in the original directory tree, you can use find for this.
Update
Based on your comments, if you want to find the full paths for filenames that appear many times, you can do the following:
First get the list of files that are missing in your second directory, e.g. like this:
comm -1 -3 file_outputs.txt file_results.txt >missing_files.txt
Then, for each missing file use find to find the full path of the specific file:
cat missing_files.txt | while read size date time name
do
find . -name "$name" -size ${size}c -newermt "$date $time" ! -newermt "$date $time +0000 +1 minutes"
done
Now notice that this is just a simple example and not optimal, and depending on the number of missing files it will call find that many times which can be slow if the directories are big as you indicated. In that case you should try to optimize it somehow (e.g. make a list of all files similar to the ls -lR but containing the full paths, and try to match that list with the list you found in the missing.txt file).
-
And what about same filename, same size (you do not compare sizes) but different content?Romeo Ninov– Romeo Ninov2023年03月06日 10:38:50 +00:00Commented Mar 6, 2023 at 10:38
-
Definitely, this will not catch these. My understanding is that the OP did not want to compare content because of the big size "The directory has a size of around 2TB and a fairly large number of files, so checking the MD5 of every files is barely an option". Even without doing MD5, comparing the content would be slow (that's why the
diff -rqfrom the OP was also slow, since even with the-qit has to compare the whole files as long as they are identical, which they will mostly be in this use case).gepa– gepa2023年03月06日 10:43:25 +00:00Commented Mar 6, 2023 at 10:43 -
IMHO the situation is opposite, hashing will work faster. And will be applicable for binary files too.Romeo Ninov– Romeo Ninov2023年03月06日 10:46:57 +00:00Commented Mar 6, 2023 at 10:46
-
Faster than
diffor faster thanls? I doubt it will be faster thandiff -qwhich only compares whether files differ or not, but that will depend on the actual implementation ofdiff -q(which btw also compares binary files). But my point was, that bothdiffandmd5would be too slow to the OP, since they would have to scan through the whole 2TB disk.gepa– gepa2023年03月06日 10:55:29 +00:00Commented Mar 6, 2023 at 10:55 -
This worked well and really fast, my only problem is that now I don't know where the files that are missing are located, since a lot of files have the same name. Is there a way to also print the full path to each file but ignore it while comparing the original outputs?muffinname– muffinname2023年03月06日 12:55:38 +00:00Commented Mar 6, 2023 at 12:55
To compare file contents, you could use the following commands:
find FolderA -type f -print0 | xargs -0 cksum > FoldA.cksum
find FolderB -type f -print0 | xargs -0 cksum > FoldB.cksum
You may sort the two files together. As the first two fields are checksum and size, you may ignore groups of two that have the same checksum and size. Groups of one denote a missing file in one folder.
-
IMHO CRC (default for above command) is quite weak as hash and you may see good amount of collisions.Romeo Ninov– Romeo Ninov2023年03月06日 12:42:16 +00:00Commented Mar 6, 2023 at 12:42
-
1@RomeoNinov: It's easy enough to test. It would be interesting if the poster could do some comparative tests. I never saw checksum collisions, but I haven't used it extensively.harrymc– harrymc2023年03月06日 13:32:58 +00:00Commented Mar 6, 2023 at 13:32
-
I've tried both this and the the sha1sum version, but both seem to be far too slow and were hogging too many resources, so I had to stop them. However, it's 'only' 155k files in total, so a collision seems unlikely to me? I'm sorry if my original post made it seem like much more, it's just that most things I had tried did take surprisingly much longer than ls and I didn't know how many files there were myself.muffinname– muffinname2023年03月06日 17:07:36 +00:00Commented Mar 6, 2023 at 17:07
One possible way is to use hashes:
cd /directory1
sha1sum * **/* >/tmp/sum
cd /directory2
sha1sum -c /tmp/sum
The odd construction **/* is to search in subdirectories (globbing should be enabled) this will generate hash of files in first directory and check it with the hash in second directory with indication about OK file and missing/mismatch hash:
#a/aa: OK
rr: OK
zzz: FAILED
sha1sum: WARNING: 1 of 3 computed checksums did NOT match
P.S. Do not be afraid of using hash functions, they are quite fast
-
Thanks, something like this should work. I assume sha1sum is faster than cksum suggested harrymc?muffinname– muffinname2023年03月06日 12:47:33 +00:00Commented Mar 6, 2023 at 12:47
-
@muffinname, I am not sure, but AFAIK the default hash of
cksumis CRC which create more probabilities for collision (different files with same hash). W/o tests I am 99% surecksumwith CRC is faster thansha1sum:)Romeo Ninov– Romeo Ninov2023年03月06日 12:52:56 +00:00Commented Mar 6, 2023 at 12:52
rsync -navi, and while it was really fast, I could not get it to compare just the files while ignoring the path to the filefile1withfile? And do you want to check if the file exist in second directory or if the same file exist in second directory?