I have a directory with multiple img files and some of them are identical but they all have different names. I need to remove duplicates but with no external tools only with a bash
script. I'm a beginner in Linux. I tried nested for loop to compare md5
sums and depending on the result remove but something is wrong with the syntax and it doesn't work. any help?
what I've tried is...
for i in directory_path; do
sum1='find $i -type f -iname "*.jpg" -exec md5sum '{}' \;'
for j in directory_path; do
sum2='find $j -type f -iname "*.jpg" -exec md5sum '{}' \;'
if test $sum1=$sum2 ; then rm $j ; fi
done
done
I get: test: too many arguments
-
Please also include any error messages you get in your question.terdon– terdon ♦2013年11月24日 17:16:04 +00:00Commented Nov 24, 2013 at 17:16
-
Why can't you use external tools like fdupes? @terdon 's answer is amazing, but it really highlights why using a good tool is the way to go if possible. If it's some kind of dedicated hardware or server, you may still be able to access it over a network, etc. from a machine that does have tools like fdupes available.Joe– Joe2013年11月30日 08:36:14 +00:00Commented Nov 30, 2013 at 8:36
2 Answers 2
There are quite a few problems in your script.
First, in order to assign the result of a command to a variable you need to enclose it either in backtics (
`command`
) or, preferably,$(command)
. You have it in single quotes ('command'
) which instead of assigning the result of your command to your variable, assigns the command itself as a string. Therefore, yourtest
is actually:$ echo "test $sum1=$sum2" test find $i -type f -iname "*.jpg" -exec md5sum {} \;=find $j -type f -iname "*.jpg" -exec md5sum {} \;
The next issue is that the command
md5sum
returns more than just the hash:$ md5sum /etc/fstab 46f065563c9e88143fa6fb4d3e42a252 /etc/fstab
You only want to compare the first field, so you should parse the
md5sum
output by passing it through a command that only prints the first field:find $i -type f -iname "*.png" -exec md5sum '{}' \; | cut -f 1 -d ' '
or
find $i -type f -iname "*.png" -exec md5sum '{}' \; | awk '{print 1ドル}'
Also, the
find
command will return many matches, not just one and each of those matches will be duplicated by the secondfind
. This means that at some point you will be comparing the same file to itself, the md5sum will be identical and you will end up deleting all your files (I ran this on a test dir containinga.jpg
andb.jpg
):for i in $(find . -iname "*.jpg"); do for j in $(find . -iname "*.jpg"); do echo "i is: $i and j is: $j" done done i is: ./a.jpg and j is: ./a.jpg ## BAD, will delete a.jpg i is: ./a.jpg and j is: ./b.jpg i is: ./b.jpg and j is: ./a.jpg i is: ./b.jpg and j is: ./b.jpg ## BAD will delete b.jpg
You don't want to run
for i in directory_path
unless you are passing an array of directories. If all these files are in the same directory, you want to runfor i in $(find directory_path -iname "*.jpg"
) to go through all the files.It is a bad idea to use
for
loops with the output of find. You should usewhile
loops or globbing:find . -iname "*.jpg" | while read i; do [...] ; done
or, if all your files re in the same directory:
for i in *jpg; do [...]; done
Depending on your shell and the options you have set, you can use globbing even for files in subdirectories but let's not get into that here.
Finally, you should also quote your variables else directory paths with spaces will break your script.
File names can contain spaces, new lines, backslashes and other weird characters, to deal with those correctly in a while
loop you'll need to add some more options. What you want to write is something like:
find dir_path -type f -iname "*.jpg" -print0 | while IFS= read -r -d '' i; do
find dir_path -type f -iname "*.jpg" -print0 | while IFS= read -r -d '' j; do
if [ "$i" != "$j" ]
then
sum1=$(md5sum "$i" | cut -f 1 -d ' ' )
sum2=$(md5sum "$j" | cut -f 1 -d ' ' )
[ "$sum1" = "$sum2" ] && rm "$j"
fi
done
done
An even simpler way would be:
find directory_path -name "*.jpg" -exec md5sum '{}' + |
perl -ane '$k{$F[0]}++; system("rm $F[1]") if $k{$F[0]}>1'
A better version that can deal with spaces in file names:
find directory_path -name "*.jpg" -exec md5sum '{}' + |
perl -ane '$k{$F[0]}++; system("rm \"@F[1 .. $#F]\"") if $k{$F[0]}>1'
This little Perl script will run through the results of the find
command (i.e. the md5sum and file name). The -a
option for perl
splits input lines at whitespace and saves them in the F
array, so $F[0]
will be the md5sum and $F[1]
the file name. The md5sum is saved in the hash k
and the script checks if the hash has already been seen (if $k{$F[0]}>1
) and deletes the file if it has (system("rm $F[1]")
).
While that will work, it will be very slow for large image collections and you cannot choose which files to keep. There are many programs that handle this in a more elegant way including:
-
+1 for the Perl snippet. Really elegant! You can also use Perl's own
unlink
instead of making asystem
call.Joseph R.– Joseph R.2013年11月24日 18:05:25 +00:00Commented Nov 24, 2013 at 18:05 -
@JosephR. thanks :). Had a bug though, it would fail for file names with spaces since only the first chars of a name up to the first space would be in
$F[1]
. Fixed it using array slices. As for unlink() I know, but wanted to keep the perlisms to a minimum and the system call is easier to understand if you don't know Perl.2013年11月24日 18:11:10 +00:00Commented Nov 24, 2013 at 18:11
There is a nifty program called fdupes
that simplifies the whole process and prompts the user for deleting duplicates. I think it is worth checking:
$ fdupes --delete DIRECTORY_WITH_DUPLICATES
[1] DIRECTORY_WITH_DUPLICATES/package-0.1-linux.tar.gz
[2] DIRECTORY_WITH_DUPLICATES/package-0.1-linux.tar.gz.1
Set 1 of 1, preserve files [1 - 2, all]: 1
[+] DIRECTORY_WITH_DUPLICATES/package-0.1-linux.tar.gz
[-] DIRECTORY_WITH_DUPLICATES/package-0.1-linux.tar.gz.1
Basically, it prompted me for which file to keep, I typed 1, and it removed the second.
Other interesting options are:
-r --recurse
for every directory given follow subdirectories encountered within
-N --noprompt
when used together with --delete, preserve the first file in each set of duplicates and delete the others without prompting the user
From your example, you probably want to run it as:
fdupes --recurse --delete --noprompt DIRECTORY_WITH_DUPLICATES
See man fdupes
for all options available.