find and remove duplicates in a directory

Question 1

I have a directory with multiple img files and some of them are identical but they all have different names. I need to remove duplicates but with no external tools only with a bash script. I'm a beginner in Linux. I tried nested for loop to compare md5 sums and depending on the result remove but something is wrong with the syntax and it doesn't work. any help?

what I've tried is...

for i in directory_path; do
 sum1='find $i -type f -iname "*.jpg" -exec md5sum '{}' \;'
 for j in directory_path; do
 sum2='find $j -type f -iname "*.jpg" -exec md5sum '{}' \;'
 if test $sum1=$sum2 ; then rm $j ; fi
 done
done

I get: test: too many arguments

Question 2

Please also include any error messages you get in your question.

Question 3

Why can't you use external tools like fdupes? @terdon 's answer is amazing, but it really highlights why using a good tool is the way to go if possible. If it's some kind of dedicated hardware or server, you may still be able to access it over a network, etc. from a machine that does have tools like fdupes available.

Question 4

There are quite a few problems in your script.

First, in order to assign the result of a command to a variable you need to enclose it either in backtics (`command`) or, preferably, $(command). You have it in single quotes ('command') which instead of assigning the result of your command to your variable, assigns the command itself as a string. Therefore, your test is actually:
```
$ echo "test $sum1=$sum2"
test find $i -type f -iname "*.jpg" -exec md5sum {} \;=find $j -type f -iname "*.jpg" -exec md5sum {} \;
```
The next issue is that the command md5sum returns more than just the hash:
```
$ md5sum /etc/fstab
46f065563c9e88143fa6fb4d3e42a252 /etc/fstab
```
You only want to compare the first field, so you should parse the md5sum output by passing it through a command that only prints the first field:
```
find $i -type f -iname "*.png" -exec md5sum '{}' \; | cut -f 1 -d ' '
```
or
```
find $i -type f -iname "*.png" -exec md5sum '{}' \; | awk '{print 1ドル}' 
```
Also, the find command will return many matches, not just one and each of those matches will be duplicated by the second find. This means that at some point you will be comparing the same file to itself, the md5sum will be identical and you will end up deleting all your files (I ran this on a test dir containing a.jpg and b.jpg):
```
for i in $(find . -iname "*.jpg"); do
 for j in $(find . -iname "*.jpg"); do
 echo "i is: $i and j is: $j"
 done
done 
i is: ./a.jpg and j is: ./a.jpg ## BAD, will delete a.jpg
i is: ./a.jpg and j is: ./b.jpg
i is: ./b.jpg and j is: ./a.jpg
i is: ./b.jpg and j is: ./b.jpg ## BAD will delete b.jpg
```
You don't want to run for i in directory_path unless you are passing an array of directories. If all these files are in the same directory, you want to run for i in $(find directory_path -iname "*.jpg") to go through all the files.
It is a bad idea to use for loops with the output of find. You should use while loops or globbing:
```
find . -iname "*.jpg" | while read i; do [...] ; done
```
or, if all your files re in the same directory:
```
for i in *jpg; do [...]; done
```
Depending on your shell and the options you have set, you can use globbing even for files in subdirectories but let's not get into that here.
Finally, you should also quote your variables else directory paths with spaces will break your script.

File names can contain spaces, new lines, backslashes and other weird characters, to deal with those correctly in a while loop you'll need to add some more options. What you want to write is something like:

find dir_path -type f -iname "*.jpg" -print0 | while IFS= read -r -d '' i; do
 find dir_path -type f -iname "*.jpg" -print0 | while IFS= read -r -d '' j; do
 if [ "$i" != "$j" ]
 then
 sum1=$(md5sum "$i" | cut -f 1 -d ' ' )
 sum2=$(md5sum "$j" | cut -f 1 -d ' ' )
 [ "$sum1" = "$sum2" ] && rm "$j"
 fi
 done
done

An even simpler way would be:

find directory_path -name "*.jpg" -exec md5sum '{}' + | 
 perl -ane '$k{$F[0]}++; system("rm $F[1]") if $k{$F[0]}>1'

A better version that can deal with spaces in file names:

find directory_path -name "*.jpg" -exec md5sum '{}' + | 
 perl -ane '$k{$F[0]}++; system("rm \"@F[1 .. $#F]\"") if $k{$F[0]}>1'

This little Perl script will run through the results of the find command (i.e. the md5sum and file name). The -a option for perl splits input lines at whitespace and saves them in the F array, so $F[0] will be the md5sum and $F[1] the file name. The md5sum is saved in the hash k and the script checks if the hash has already been seen (if $k{$F[0]}>1) and deletes the file if it has (system("rm $F[1]")).

While that will work, it will be very slow for large image collections and you cannot choose which files to keep. There are many programs that handle this in a more elegant way including:

fdupes
fslint
Various other options listed here.

Question 5

+1 for the Perl snippet. Really elegant! You can also use Perl's own unlink instead of making a system call.

Question 6

@JosephR. thanks :). Had a bug though, it would fail for file names with spaces since only the first chars of a name up to the first space would be in $F[1]. Fixed it using array slices. As for unlink() I know, but wanted to keep the perlisms to a minimum and the system call is easier to understand if you don't know Perl.

Question 7

There is a nifty program called fdupes that simplifies the whole process and prompts the user for deleting duplicates. I think it is worth checking:

$ fdupes --delete DIRECTORY_WITH_DUPLICATES
[1] DIRECTORY_WITH_DUPLICATES/package-0.1-linux.tar.gz 
[2] DIRECTORY_WITH_DUPLICATES/package-0.1-linux.tar.gz.1
Set 1 of 1, preserve files [1 - 2, all]: 1
 [+] DIRECTORY_WITH_DUPLICATES/package-0.1-linux.tar.gz
 [-] DIRECTORY_WITH_DUPLICATES/package-0.1-linux.tar.gz.1

Basically, it prompted me for which file to keep, I typed 1, and it removed the second.

Other interesting options are:

-r --recurse
 for every directory given follow subdirectories encountered within
-N --noprompt
 when used together with --delete, preserve the first file in each set of duplicates and delete the others without prompting the user

From your example, you probably want to run it as:

fdupes --recurse --delete --noprompt DIRECTORY_WITH_DUPLICATES

See man fdupes for all options available.

terdon ♦terdon 252k69 gold badges480 silver badges717 bronze badges · Accepted Answer · 2013-11-24 17:48:44Z

There are quite a few problems in your script.

First, in order to assign the result of a command to a variable you need to enclose it either in backtics (`command`) or, preferably, $(command). You have it in single quotes ('command') which instead of assigning the result of your command to your variable, assigns the command itself as a string. Therefore, your test is actually:
```
$ echo "test $sum1=$sum2"
test find $i -type f -iname "*.jpg" -exec md5sum {} \;=find $j -type f -iname "*.jpg" -exec md5sum {} \;
```
The next issue is that the command md5sum returns more than just the hash:
```
$ md5sum /etc/fstab
46f065563c9e88143fa6fb4d3e42a252 /etc/fstab
```
You only want to compare the first field, so you should parse the md5sum output by passing it through a command that only prints the first field:
```
find $i -type f -iname "*.png" -exec md5sum '{}' \; | cut -f 1 -d ' '
```
or
```
find $i -type f -iname "*.png" -exec md5sum '{}' \; | awk '{print 1ドル}' 
```
Also, the find command will return many matches, not just one and each of those matches will be duplicated by the second find. This means that at some point you will be comparing the same file to itself, the md5sum will be identical and you will end up deleting all your files (I ran this on a test dir containing a.jpg and b.jpg):
```
for i in $(find . -iname "*.jpg"); do
 for j in $(find . -iname "*.jpg"); do
 echo "i is: $i and j is: $j"
 done
done 
i is: ./a.jpg and j is: ./a.jpg ## BAD, will delete a.jpg
i is: ./a.jpg and j is: ./b.jpg
i is: ./b.jpg and j is: ./a.jpg
i is: ./b.jpg and j is: ./b.jpg ## BAD will delete b.jpg
```
You don't want to run for i in directory_path unless you are passing an array of directories. If all these files are in the same directory, you want to run for i in $(find directory_path -iname "*.jpg") to go through all the files.
It is a bad idea to use for loops with the output of find. You should use while loops or globbing:
```
find . -iname "*.jpg" | while read i; do [...] ; done
```
or, if all your files re in the same directory:
```
for i in *jpg; do [...]; done
```
Depending on your shell and the options you have set, you can use globbing even for files in subdirectories but let's not get into that here.
Finally, you should also quote your variables else directory paths with spaces will break your script.

File names can contain spaces, new lines, backslashes and other weird characters, to deal with those correctly in a while loop you'll need to add some more options. What you want to write is something like:

find dir_path -type f -iname "*.jpg" -print0 | while IFS= read -r -d '' i; do
 find dir_path -type f -iname "*.jpg" -print0 | while IFS= read -r -d '' j; do
 if [ "$i" != "$j" ]
 then
 sum1=$(md5sum "$i" | cut -f 1 -d ' ' )
 sum2=$(md5sum "$j" | cut -f 1 -d ' ' )
 [ "$sum1" = "$sum2" ] && rm "$j"
 fi
 done
done

An even simpler way would be:

find directory_path -name "*.jpg" -exec md5sum '{}' + | 
 perl -ane '$k{$F[0]}++; system("rm $F[1]") if $k{$F[0]}>1'

A better version that can deal with spaces in file names:

find directory_path -name "*.jpg" -exec md5sum '{}' + | 
 perl -ane '$k{$F[0]}++; system("rm \"@F[1 .. $#F]\"") if $k{$F[0]}>1'

This little Perl script will run through the results of the find command (i.e. the md5sum and file name). The -a option for perl splits input lines at whitespace and saves them in the F array, so $F[0] will be the md5sum and $F[1] the file name. The md5sum is saved in the hash k and the script checks if the hash has already been seen (if $k{$F[0]}>1) and deletes the file if it has (system("rm $F[1]")).

While that will work, it will be very slow for large image collections and you cannot choose which files to keep. There are many programs that handle this in a more elegant way including:

fdupes
fslint
Various other options listed here.

+1 for the Perl snippet. Really elegant! You can also use Perl's own unlink instead of making a system call.
@JosephR. thanks :). Had a bug though, it would fail for file names with spaces since only the first chars of a name up to the first space would be in $F[1]. Fixed it using array slices. As for unlink() I know, but wanted to keep the perlisms to a minimum and the system call is easier to understand if you don't know Perl.

Stack Exchange Network

find and remove duplicates in a directory

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

find and remove duplicates in a directory

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions