Find duplicate files in Linux

Question 1

This code finds duplicate files in Linux. I assume that MD5 hash will return unique hash number. What can I improve in this code?

#!/usr/bin/python;
import sys
import subprocess
import os
import hashlib
FIND_CMD = "find . -name \*.* -print"
BUFFER = 65536
#Get path from the user
root_path = sys.argv[1]
#Checks that the user sent path
if not root_path:
 print("Error: No file specified.Please try again ")
 sys.exit(1)
#Chage working path
os.chdir(root_path)
#Get all the possible paths under the given directory
dirty_out = subprocess.run(FIND_CMD.split(" "), stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
file_list = dirty_out.stdout.decode("utf-8").splitlines()
#Find files with the same size and add to dic
same_size = dict()
for file in file_list:
 path = root_path + file[1:]
 if os.path.isdir(path):
 continue
 if os.path.exists(path):
 file_size = os.path.getsize(path)
 #Add the path to dictionary by size
 if file_size not in same_size:
 same_size[file_size] = list([path])
 else:
 same_size[file_size].append(path)
#Find files with the same size and hash code
file_signatures = dict()
for size in same_size:
 if len (same_size[size]) > 1:
 i = 0
 while i < len (same_size[size]):
 # Hash file content with read buffer
 md5 = hashlib.md5()
 path = same_size[size][i]
 md5 = hashlib.md5()
 with open(path, 'rb') as f:
 while True:
 data = f.read(BUFFER)
 if not data:
 break
 md5.update(data)
 md5_sig = md5.hexdigest()
 # Add to dictionary only files with the same size and hash
 if md5_sig not in file_signatures:
 file_signatures[md5_sig] = list([path])
 else:
 file_signatures[md5_sig].append(path)
 i=i+1
#Prints the path of all the duplicate files separated with , 
for sig in file_signatures:
 if len(file_signatures[sig]) > 1:
 print("{0}\n".format(",".join(file_signatures[sig])))

Question 2

If you're not writing this for coding practice, you ought to be aware of the rdfind program, which does the same thing (and more).

Question 3

Please be careful with the indentation because it looks like there are different number of spaces for different indentation depths and it looks very weird.

Minor improvements:

Repetition

 md5 = hashlib.md5()
 path = same_size[size][i]
 md5 = hashlib.md5()

I am almost sure there is no reason to repeat md5 = hashlib.md5() twice.

"Native" iteration

while i < len (same_size[size]):
 # Hash file content with read buffer
 md5 = hashlib.md5()
 path = same_size[size][i]
 ...
 i=i+1

You are iterating over all the files with the given size:

for path in same_size[size]:
 ...

In general in Python you can use for x in list_ to iterate over all items in any collection.

Zero length automatically skipped

if len (same_size[size]) > 1:

If you ask python to iterate over a zero length collection nothing happens, so you can remove this line as it does nothing but increases indentation and line count.

edit: No @miracle173 correctly noted that this is an efficiency gain to skip files of which there is only one of a certain size so it cannot be skipped.

Default dict to simplify logic

 if file_size not in same_size:
 same_size[file_size] = list([path])
 else:
 same_size[file_size].append(path)

Not tested, but using a defaultdict you can avoid the branch and only use append.

Same size?

Are you sure you need to calculate also the size in order to check if two files are equal? Isn't the hash enough?

Question 4

"Same size" If there is only one file of a given size, this file cannot have a duplicate and you don't have to calculate the hash at all. This can save a lot of disk reads.

Question 5

if len (same_size[size]) > 1 does not mean to skip zero length lists of files of a given size but lists of length 1, so for this size exists only one file ans so there is no duplicate. so this if-clause cannot be skipped.

Question 6

@miracle173 You are correct, answer edited

Question 7

The execution time of the program is determined by number of disk reads it has to make. So avoiding unnecessary disk reads by checking first if there is a file with the same size is a good thing. Maybe it makes sense, at least for large files, to calculate the hash only for a small part of the file. e.g. for the first BUFFER bytes, and you can avoid reading more of a file if no other file starts with the same BUFFER bytes. If it is possible to read blocks at any position of a file without reading all the blocks preceding this block, you should also use blocks not only from the beginning.
```
FIND_CMD = "find . -name \*.* -print"
```
I don't understand why the first * is escaped by a \ but the 2nd * is not.
Python has a way to list all files of a directory, so it is not necessary to exectute a shell command:

os.walk(top, topdown=True, onerror=None, followlinks=False)

Generate the file names in a directory tree by walking the tree either top-down or bottom-up.
The following code will not behave as you might expect:
```
#Get path from the user
root_path = sys.argv[1]
#Checks that the user sent path
if not root_path:
 print("Error: No file specified.Please try again ")
 sys.exit(1)`
```
If no argument is supplied to the program then sys.argv[1] is not defined and an error is raised. So print("Error: No file specified.Please try again ") will never be printed.

Instead of

if os.path.isdir(path):
 continue
if os.path.exists(path):
 ...

you can do

if not os.path.isdir(path) and os.path.exists(path):
 ...

and save an ugly continue.

and a break used here

while True:
 data = f.read(BUFFER)
 if not data:
 break
 md5.update(data)

can be avoided this way

data = f.read(BUFFER)
while data:
 md5.update(data)
 data = f.read(BUFFER)

But this is a matter of taste, of course.

What happens with os.chdir(root_path), if root_path does not exist? You should not trust user input. You will get an exception. You should write an appropriate error message. Also for other OS-functions you should it may make sense to handle an error appropriate.
```
 i = 0
 while i < len (same_size[size]):
 ...
 i=i+1
```
isn't the way how such loops are handled in python.
```
for i in range(len (same_size[size])):
 ....
```
is the right way. In this special case Caridorc showed you an even better way.

In Unix like sytems error messages should be printed to stderr and not to stdout. So ths is not the proper way to do this:

if not root_path:
 print("Error: No file specified.Please try again ")
 sys.exit(1)

A possible way to achieve this is

if not root_path:
 sys.exit("Error: No file specified.Please try again ")

Your indentation is mixed. This is allowed but you should use alway the same indentation, preferable four spaces.

Caridorc CaridorcCaridorc 28k7 gold badges54 silver badges137 bronze badges · Accepted Answer · 2018-08-27 18:33:59Z

Please be careful with the indentation because it looks like there are different number of spaces for different indentation depths and it looks very weird.

Minor improvements:

Repetition

 md5 = hashlib.md5()
 path = same_size[size][i]
 md5 = hashlib.md5()

I am almost sure there is no reason to repeat md5 = hashlib.md5() twice.

"Native" iteration

while i < len (same_size[size]):
 # Hash file content with read buffer
 md5 = hashlib.md5()
 path = same_size[size][i]
 ...
 i=i+1

You are iterating over all the files with the given size:

for path in same_size[size]:
 ...

In general in Python you can use for x in list_ to iterate over all items in any collection.

Zero length automatically skipped

if len (same_size[size]) > 1:

If you ask python to iterate over a zero length collection nothing happens, so you can remove this line as it does nothing but increases indentation and line count.

edit: No @miracle173 correctly noted that this is an efficiency gain to skip files of which there is only one of a certain size so it cannot be skipped.

Default dict to simplify logic

 if file_size not in same_size:
 same_size[file_size] = list([path])
 else:
 same_size[file_size].append(path)

Not tested, but using a defaultdict you can avoid the branch and only use append.

Same size?

Are you sure you need to calculate also the size in order to check if two files are equal? Isn't the hash enough?

"Same size" If there is only one file of a given size, this file cannot have a duplicate and you don't have to calculate the hash at all. This can save a lot of disk reads.
if len (same_size[size]) > 1 does not mean to skip zero length lists of files of a given size but lists of length 1, so for this size exists only one file ans so there is no duplicate. so this if-clause cannot be skipped.

Stack Exchange Network

Find duplicate files in Linux

2 Answers 2

Repetition

"Native" iteration

Zero length automatically skipped

Default dict to simplify logic

Same size?

`os.walk(top, topdown=True, onerror=None, followlinks=False)`

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Find duplicate files in Linux

2 Answers 2

Repetition

"Native" iteration

Zero length automatically skipped

Default dict to simplify logic

Same size?

os.walk(top, topdown=True, onerror=None, followlinks=False)

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`os.walk(top, topdown=True, onerror=None, followlinks=False)`