This code finds duplicate files in Linux. I assume that MD5 hash will return unique hash number. What can I improve in this code?
#!/usr/bin/python;
import sys
import subprocess
import os
import hashlib
FIND_CMD = "find . -name \*.* -print"
BUFFER = 65536
#Get path from the user
root_path = sys.argv[1]
#Checks that the user sent path
if not root_path:
print("Error: No file specified.Please try again ")
sys.exit(1)
#Chage working path
os.chdir(root_path)
#Get all the possible paths under the given directory
dirty_out = subprocess.run(FIND_CMD.split(" "), stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
file_list = dirty_out.stdout.decode("utf-8").splitlines()
#Find files with the same size and add to dic
same_size = dict()
for file in file_list:
path = root_path + file[1:]
if os.path.isdir(path):
continue
if os.path.exists(path):
file_size = os.path.getsize(path)
#Add the path to dictionary by size
if file_size not in same_size:
same_size[file_size] = list([path])
else:
same_size[file_size].append(path)
#Find files with the same size and hash code
file_signatures = dict()
for size in same_size:
if len (same_size[size]) > 1:
i = 0
while i < len (same_size[size]):
# Hash file content with read buffer
md5 = hashlib.md5()
path = same_size[size][i]
md5 = hashlib.md5()
with open(path, 'rb') as f:
while True:
data = f.read(BUFFER)
if not data:
break
md5.update(data)
md5_sig = md5.hexdigest()
# Add to dictionary only files with the same size and hash
if md5_sig not in file_signatures:
file_signatures[md5_sig] = list([path])
else:
file_signatures[md5_sig].append(path)
i=i+1
#Prints the path of all the duplicate files separated with ,
for sig in file_signatures:
if len(file_signatures[sig]) > 1:
print("{0}\n".format(",".join(file_signatures[sig])))
2 Answers 2
Please be careful with the indentation because it looks like there are different number of spaces for different indentation depths and it looks very weird.
Minor improvements:
Repetition
md5 = hashlib.md5()
path = same_size[size][i]
md5 = hashlib.md5()
I am almost sure there is no reason to repeat md5 = hashlib.md5()
twice.
"Native" iteration
while i < len (same_size[size]):
# Hash file content with read buffer
md5 = hashlib.md5()
path = same_size[size][i]
...
i=i+1
You are iterating over all the files with the given size:
for path in same_size[size]:
...
In general in Python you can use for x in list_
to iterate over all items in any collection.
Zero length automatically skipped
if len (same_size[size]) > 1:
If you ask python to iterate over a zero length collection nothing happens, so you can remove this line as it does nothing but increases indentation and line count.
edit: No @miracle173 correctly noted that this is an efficiency gain to skip files of which there is only one of a certain size so it cannot be skipped.
Default dict to simplify logic
if file_size not in same_size:
same_size[file_size] = list([path])
else:
same_size[file_size].append(path)
Not tested, but using a defaultdict
you can avoid the branch and only use append
.
Same size?
Are you sure you need to calculate also the size in order to check if two files are equal? Isn't the hash enough?
-
3\$\begingroup\$ "Same size" If there is only one file of a given size, this file cannot have a duplicate and you don't have to calculate the hash at all. This can save a lot of disk reads. \$\endgroup\$miracle173– miracle1732018年08月28日 07:25:06 +00:00Commented Aug 28, 2018 at 7:25
-
1\$\begingroup\$
if len (same_size[size]) > 1
does not mean to skip zero length lists of files of a given size but lists of length 1, so for this size exists only one file ans so there is no duplicate. so this if-clause cannot be skipped. \$\endgroup\$miracle173– miracle1732018年08月28日 21:30:55 +00:00Commented Aug 28, 2018 at 21:30 -
\$\begingroup\$ @miracle173 You are correct, answer edited \$\endgroup\$Caridorc– Caridorc2018年09月01日 10:53:06 +00:00Commented Sep 1, 2018 at 10:53
The execution time of the program is determined by number of disk reads it has to make. So avoiding unnecessary disk reads by checking first if there is a file with the same size is a good thing. Maybe it makes sense, at least for large files, to calculate the hash only for a small part of the file. e.g. for the first BUFFER bytes, and you can avoid reading more of a file if no other file starts with the same BUFFER bytes. If it is possible to read blocks at any position of a file without reading all the blocks preceding this block, you should also use blocks not only from the beginning.
FIND_CMD = "find . -name \*.* -print"
I don't understand why the first
*
is escaped by a\
but the 2nd*
is not.Python has a way to list all files of a directory, so it is not necessary to exectute a shell command:
os.walk(top, topdown=True, onerror=None, followlinks=False)
Generate the file names in a directory tree by walking the tree either top-down or bottom-up.
The following code will not behave as you might expect:
#Get path from the user root_path = sys.argv[1] #Checks that the user sent path if not root_path: print("Error: No file specified.Please try again ") sys.exit(1)`
If no argument is supplied to the program then
sys.argv[1]
is not defined and an error is raised. Soprint("Error: No file specified.Please try again ")
will never be printed.Instead of
if os.path.isdir(path): continue if os.path.exists(path): ...
you can do
if not os.path.isdir(path) and os.path.exists(path): ...
and save an ugly
continue
.and a break used here
while True: data = f.read(BUFFER) if not data: break md5.update(data)
can be avoided this way
data = f.read(BUFFER) while data: md5.update(data) data = f.read(BUFFER)
But this is a matter of taste, of course.
What happens with
os.chdir(root_path)
, if root_path does not exist? You should not trust user input. You will get an exception. You should write an appropriate error message. Also for other OS-functions you should it may make sense to handle an error appropriate.i = 0 while i < len (same_size[size]): ... i=i+1
isn't the way how such loops are handled in python.
for i in range(len (same_size[size])): ....
is the right way. In this special case Caridorc showed you an even better way.
In Unix like sytems error messages should be printed to
stderr
and not tostdout
. So ths is not the proper way to do this:if not root_path: print("Error: No file specified.Please try again ") sys.exit(1)
A possible way to achieve this is
if not root_path: sys.exit("Error: No file specified.Please try again ")
- Your indentation is mixed. This is allowed but you should use alway the same indentation, preferable four spaces.
Explore related questions
See similar questions with these tags.
rdfind
program, which does the same thing (and more). \$\endgroup\$