I'm a photographer doing many backups. Over the years I found myself with a lot of hard drives. Now I bought a NAS and copied all my pictures on one 3TB RAID 1 using rsync. According to my script, about 1TB of those files are duplicates. That comes from doing multiple backups before deleting files on my laptop and being very messy. I do have a backup of all those files on the old hard drives, but it would be a pain if my script messes things up.
Can you please have a look at my duplicate finder script and tell me if you think I can run it or not? I tried it on a test folder and it seems ok, but I don't want to mess things up on the NAS.
The script has three steps in three files. In this first part I find all image and metadata files and put them into a shelve database (datenbank
) with their size as key.
If it's somehow important: It's a synology 713+ and has an ext3 or ext4 filesystem.
import os
import shelve
datenbank = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step1"), flag='c', protocol=None, writeback=False)
#path_to_search = os.path.join(os.path.dirname(__file__),"test")
path_to_search = "/volume1/backup_2tb_wd/"
file_exts = ["xmp", "jpg", "JPG", "XMP", "cr2", "CR2", "PNG", "png", "tiff", "TIFF"]
walker = os.walk(path_to_search)
counter = 0
for dirpath, dirnames, filenames in walker:
if filenames:
for filename in filenames:
counter += 1
print str(counter)
for file_ext in file_exts:
if file_ext in filename:
filepath = os.path.join(dirpath, filename)
filesize = str(os.path.getsize(filepath))
if not filesize in datenbank:
datenbank[filesize] = []
tmp = datenbank[filesize]
if filepath not in tmp:
tmp.append(filepath)
datenbank[filesize] = tmp
datenbank.sync()
print "done"
datenbank.close()
This is the second part. Now I drop all file sizes which only have one file in their list and create another shelve database with the MD5 hash as a key and a list of files as a value.
import os
import shelve
import hashlib
datenbank = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step1"), flag='c', protocol=None, writeback=False)
datenbank_step2 = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step2"), flag='c', protocol=None, writeback=False)
counter = 0
space = 0
def md5Checksum(filePath):
with open(filePath, 'rb') as fh:
m = hashlib.md5()
while True:
data = fh.read(8192)
if not data:
break
m.update(data)
return m.hexdigest()
for filesize in datenbank:
filepaths = datenbank[filesize]
filepath_count = len(filepaths)
if filepath_count > 1:
counter += filepath_count -1
space += (filepath_count -1) * int(filesize)
for filepath in filepaths:
print counter
checksum = md5Checksum(filepath)
if checksum not in datenbank_step2:
datenbank_step2[checksum] = []
temp = datenbank_step2[checksum]
if filepath not in temp:
temp.append(filepath)
datenbank_step2[checksum] = temp
print counter
print str(space)
datenbank_step2.sync()
datenbank_step2.close()
print "done"
And finally the most dangerous part. For every MD5 key, I retrieve the file list and do an additional SHA1. If it matches, I delete every file in that list except the first one and create a hard link to replace the deleted files.
import os
import shelve
import hashlib
datenbank = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step2"), flag='c', protocol=None, writeback=False)
def sha1Checksum(filePath):
with open(filePath, 'rb') as fh:
m = hashlib.sha1()
while True:
data = fh.read(8192)
if not data:
break
m.update(data)
return m.hexdigest()
for hashvalue in datenbank:
switch = True
for path in datenbank[hashvalue]:
if switch:
original = path
original_checksum = sha1Checksum(path)
switch = False
else:
if sha1Checksum(path) == original_checksum:
os.unlink(path)
os.link(original, path)
print "delete: ", path
print "done"
-
\$\begingroup\$ Python 2.7.5 right? \$\endgroup\$root-11– root-112013年06月24日 18:25:49 +00:00Commented Jun 24, 2013 at 18:25
1 Answer 1
Your code is very risky because you use a weak checksum (md5 - see wikipedia for own study), but since an error would be devastating please use sha256.
Let me quote this:
I strongly question your use of MD5. You should be at least using SHA1. Some people think that as long as you're not using MD5 for 'cryptographic' purposes, you're fine. But stuff has a tendency to end up being broader in scope than you initially expect, and your casual vulnerability analysis may prove completely flawed. It's best to just get in the habit of using the right algorithm out of the gate. It's just typing a different bunch of
letters is all. It's not that hard.
Second I have added an "inspection loop" in the main code, so that it creates a csv file which you can play around with to check what the code will do (I checked the csv-data using a pivot table in excel).
So in summary I have rewritten your code as follows:
Python 2.7.4
import os
import os.path
import hashlib
import csv
"""
Recipe:
1. We identify all files on your system and store those with the wanted
extensions in a table with this structure:
sha256 | filename.ext | keep | link | size | filepath
----------+--------------+-------+-------+------+------
23eadf3ed | summer.jpg | True | False | 1234 | /volume1/backup_2tb_wd/randomStuff/
23eadf3ed | summer.jpg | False | False | 1234 | /volume1/backup_2tb_wd/Stuff/
23eadf3ed | summer.jpg | False | False | 1234 | /volume1/backup_2tb_wd/Holiday/
To spot a link: os.path.islink('path+filename') # returns True if link.
To get filesize: os.path.getsize(join(root, name)) # returns bytes as integer.
Why links? Because os.link doesn't like soft link. The hard links will
survive, but any soft links will leave you in a mess.
Then we select 1 record from the distinct list of sha256s and update the
value for the column "Keep" to "Y". To make sure that we do not catch a
symlink we check that it is not a link.
2. Now we cycle through the records in the following manner:
3. Now I would like to know how much space you saved. So we create a summary:
"""
def hashfile(afile, blocksize=2*1024*1024): # load 2Mb
with open(afile, 'rb') as f:
buf = [1]
shasum = hashlib.sha256()
while len(buf)>0:
buf = f.read(blocksize)
shasum.update(buf)
return str(shasum.hexdigest()) # hashlib.sha256('foo').hexdigest()
def convert_to_a_lowercase_set(alist):
for item in alist:
alist[alist.index(item)]=item.lower()
aset = set(alist)
return aset
def get_the_data(path_to_search, file_exts):
file_exts = convert_to_a_lowercase_set(file_exts)
data=[]
shas=set()
for root, dirs, files in os.walk(path_to_search):
for name in files:
if name[-3:].lower() in file_exts:
filepath = os.path.join(root, name)
filename = name
link = os.path.islink(filepath) # returns True or False
if link==False:
size = os.path.getsize(filepath) # returns Int
sha256 = hashfile(filepath) # returns hexadecimal
if sha256 not in shas:
shas.add(sha256)
keep = True # we keep the first found original file.
else:
keep = False # we overwrite soft links with hard links.
else:
size = 0
sha256 = 'aaaaaaaaaaaaaaaaaaa' # returns hexadecimal
keep = False
data.append((sha256, filename, keep, link, size, filepath)) #! order matters!
return data
def writeCSVfile(data, datafile):
with open(datafile, 'wb') as f:
writer = csv.writer(f)
writer.writerow(('sha256', 'filename', 'keep', 'link', 'size', 'filepath'))
writer.writerows(data)
def spaceSaved(data):
return sum([row[4] for row in data if row[2]==False])
def relinkDuplicateFiles(data):
sha256s = (row for row in data if row[2]==True) # unique set of sha256's
for sha in sha256s:
original_file = sha[5]
redudant_copies = [row[5] for row in data if row[0]==sha[0] and row[2]==False and row[3]==False]
for record in redudant_copies:
os.remove(record)
os.link(original_file, record)
def main():
# (0) Loading your starting values.
path_to_search = r'/volume1/backup_2tb_wd/'
datafile = path_to_search+'data.csv'
file_exts = ["xmp", "jpg", "JPG", "XMP", "cr2", "CR2", "PNG", "png", "tiff", "TIFF"]
# (1) Get the data
print "getting the data...\nThis might take a while..."
data = get_the_data(path_to_search, file_exts)
# (2) Hard link duplicates in stead of having redundant files.
msg = """
--------------------
Data captured. Initiate Relinking of redundant files...?
Options:
Press D + enter to view data file and exit
Press N + enter to exit
Press Y + enter to clean up...
--------------------
Choice: """
# (3) Providing a panic button...
while True:
print msg
response = raw_input("Response: ")
if response == "D":
print "writing CSV file..."
writeCSVfile(data, datafile)
print "file written: "+datafile
elif response == "N":
print "exiting...."
data=None
break
elif response == "Y":
print "relinking duplicate files..."
relinkDuplicateFiles(data)
print "space saved: "+str(spaceSaved(data))+"bytes"
break
else:
print "no such option. Retry: "
if __name__ == '__main__':
main()
I'm sure you will recognise the code of the function: relinkDuplicateFiles(), but beyond that there is little resemblance.
Test
I have tested the code on a test library on Ubuntu-13.04 with 2.7.4.The test was performed as follows: Before running the python script I bash'ed:
ls -liR
This enables me to see the number of links just behind the rights (bold 2)
3541475 -rw-r--r-- 2 bjorn bjorn 64209 Jun 26 17:20 05hardlink.jpg
Bash before:
bjorn@EEEbox:~/ownCloud/Test$ ls -liR
.:
total 44
3541027 drwxr-xr-x 4 bjorn bjorn 4096 Jun 26 13:50 2001
3541474 drwxr-xr-x 2 bjorn bjorn 4096 Jun 26 17:25 2001b
3542165 -rw-rw-r-- 1 bjorn bjorn 7054 Jun 26 16:35 data(after).csv
3542163 -rw-rw-r-- 1 bjorn bjorn 7054 Jun 26 16:34 data(before).csv
3542168 -rw-rw-r-- 1 bjorn bjorn 8036 Jun 26 17:52 data.csv
3542164 -rw-rw-r-- 1 bjorn bjorn 7054 Jun 26 16:27 data (org).csv
3542166 -rwxrw-r-- 1 bjorn bjorn 571 Jun 26 16:57 findhardlinks.sh
./2001:
total 944
3541401 -rw-r--r-- 1 bjorn bjorn 347991 Apr 23 18:10 008_05a.jpg
3541320 -rw-r--r-- 1 bjorn bjorn 33055 Apr 23 18:10 04.jpg
3541261 -rw-r--r-- 1 bjorn bjorn 64209 Apr 23 18:10 05.jpg
3541234 -rw-r--r-- 1 bjorn bjorn 70573 Apr 23 18:10 06.jpg
3541454 -rw-r--r-- 1 bjorn bjorn 70906 Apr 23 18:11 07.jpg
3541694 -rw-r--r-- 1 bjorn bjorn 78251 Apr 23 18:10 08.jpg
3541393 -rw-r--r-- 1 bjorn bjorn 61995 Apr 23 18:11 09.jpg
3541737 -rw-r--r-- 1 bjorn bjorn 67659 Apr 23 18:10 10.jpg
3541790 -rw-r--r-- 1 bjorn bjorn 68620 Apr 23 18:11 11.jpg
3541086 -rw-r--r-- 1 bjorn bjorn 74453 Apr 23 18:11 12.jpg
3541028 drwxr-xr-x 3 bjorn bjorn 4096 Jun 26 17:26 2001
./2001/2001:
total 1216
3541920 -rw-r--r-- 1 bjorn bjorn 347991 Apr 23 18:10 008_05a.jpg
3541854 -rw-r--r-- 1 bjorn bjorn 95391 Apr 23 18:10 01.jpg
3541415 -rw-r--r-- 1 bjorn bjorn 68238 Apr 23 18:11 02.jpg
3541196 -rw-r--r-- 1 bjorn bjorn 74282 Apr 23 18:11 03.jpg
3541834 -rw-r--r-- 1 bjorn bjorn 33055 Apr 23 18:10 04.jpg
3541544 -rw-r--r-- 6 bjorn bjorn 33055 Apr 23 18:10 04pyoslink4.jpg
3541871 -rw-r--r-- 1 bjorn bjorn 64209 Apr 23 18:10 05.jpg
3541461 -rw-r--r-- 1 bjorn bjorn 70573 Apr 23 18:10 06.jpg
3541560 -rw-r--r-- 1 bjorn bjorn 70906 Apr 23 18:11 07.jpg
3541670 -rw-r--r-- 1 bjorn bjorn 78251 Apr 23 18:11 08.jpg
3541441 -rw-r--r-- 1 bjorn bjorn 61995 Apr 23 18:11 09.jpg
3541863 -rw-r--r-- 1 bjorn bjorn 67659 Apr 23 18:10 10.jpg
3541836 -rw-r--r-- 1 bjorn bjorn 68620 Apr 23 18:11 11.jpg
3541841 -rw-r--r-- 1 bjorn bjorn 74453 Apr 23 18:10 12.jpg
./2001b:
total 312
3541544 -rw-r--r-- 6 bjorn bjorn 33055 Apr 23 18:10 04hardlink.jpg
3541544 -rw-r--r-- 6 bjorn bjorn 33055 Apr 23 18:10 04.jpg
3541961 -rw-r--r-- 1 bjorn bjorn 1220 Jun 26 14:02 04.lnk
3541544 -rw-r--r-- 6 bjorn bjorn 33055 Apr 23 18:10 04pyoslink2.jpg
3541544 -rw-r--r-- 6 bjorn bjorn 33055 Apr 23 18:10 04pyoslink3.jpg
3541544 -rw-r--r-- 6 bjorn bjorn 33055 Apr 23 18:10 04pyoslink.jpg
3542167 lrwxrwxrwx 1 bjorn bjorn 14 Jun 26 17:16 04softlink.jpg -> ./2001b/04.jpg
3541475 -rw-r--r-- 2 bjorn bjorn 64209 Jun 26 17:20 05hardlink.jpg
3541475 -rw-r--r-- 2 bjorn bjorn 64209 Jun 26 17:20 05.jpg
So after running the script you can run the same bash command again:
ls -liR
and get...
Bash after:
bjorn@EEEbox:~/ownCloud/Test$ ls -liR
.:
total 44
3541027 drwxr-xr-x 4 bjorn bjorn 4096 Jun 26 18:04 2001
3541474 drwxr-xr-x 2 bjorn bjorn 4096 Jun 26 18:04 2001b
3542165 -rw-rw-r-- 1 bjorn bjorn 7054 Jun 26 16:35 data(after).csv
3542163 -rw-rw-r-- 1 bjorn bjorn 7054 Jun 26 16:34 data(before).csv
3542168 -rw-rw-r-- 1 bjorn bjorn 8036 Jun 26 17:52 data.csv
3542164 -rw-rw-r-- 1 bjorn bjorn 7054 Jun 26 16:27 data (org).csv
3542166 -rwxrw-r-- 1 bjorn bjorn 571 Jun 26 16:57 findhardlinks.sh
./2001:
total 944
3541401 -rw-r--r-- 2 bjorn bjorn 347991 Apr 23 18:10 008_05a.jpg
3541544 -rw-r--r-- 8 bjorn bjorn 33055 Apr 23 18:10 04.jpg
3541475 -rw-r--r-- 4 bjorn bjorn 64209 Jun 26 17:20 05.jpg
3541234 -rw-r--r-- 2 bjorn bjorn 70573 Apr 23 18:10 06.jpg
3541454 -rw-r--r-- 2 bjorn bjorn 70906 Apr 23 18:11 07.jpg
3541694 -rw-r--r-- 2 bjorn bjorn 78251 Apr 23 18:10 08.jpg
3541393 -rw-r--r-- 2 bjorn bjorn 61995 Apr 23 18:11 09.jpg
3541737 -rw-r--r-- 2 bjorn bjorn 67659 Apr 23 18:10 10.jpg
3541790 -rw-r--r-- 2 bjorn bjorn 68620 Apr 23 18:11 11.jpg
3541086 -rw-r--r-- 2 bjorn bjorn 74453 Apr 23 18:11 12.jpg
3541028 drwxr-xr-x 3 bjorn bjorn 4096 Jun 26 18:04 2001
./2001/2001:
total 1216
3541401 -rw-r--r-- 2 bjorn bjorn 347991 Apr 23 18:10 008_05a.jpg
3541854 -rw-r--r-- 1 bjorn bjorn 95391 Apr 23 18:10 01.jpg
3541415 -rw-r--r-- 1 bjorn bjorn 68238 Apr 23 18:11 02.jpg
3541196 -rw-r--r-- 1 bjorn bjorn 74282 Apr 23 18:11 03.jpg
3541544 -rw-r--r-- 8 bjorn bjorn 33055 Apr 23 18:10 04.jpg
3541544 -rw-r--r-- 8 bjorn bjorn 33055 Apr 23 18:10 04pyoslink4.jpg
3541475 -rw-r--r-- 4 bjorn bjorn 64209 Jun 26 17:20 05.jpg
3541234 -rw-r--r-- 2 bjorn bjorn 70573 Apr 23 18:10 06.jpg
3541454 -rw-r--r-- 2 bjorn bjorn 70906 Apr 23 18:11 07.jpg
3541694 -rw-r--r-- 2 bjorn bjorn 78251 Apr 23 18:10 08.jpg
3541393 -rw-r--r-- 2 bjorn bjorn 61995 Apr 23 18:11 09.jpg
3541737 -rw-r--r-- 2 bjorn bjorn 67659 Apr 23 18:10 10.jpg
3541790 -rw-r--r-- 2 bjorn bjorn 68620 Apr 23 18:11 11.jpg
3541086 -rw-r--r-- 2 bjorn bjorn 74453 Apr 23 18:11 12.jpg
./2001b:
total 312
3541544 -rw-r--r-- 8 bjorn bjorn 33055 Apr 23 18:10 04hardlink.jpg
3541544 -rw-r--r-- 8 bjorn bjorn 33055 Apr 23 18:10 04.jpg
3541961 -rw-r--r-- 1 bjorn bjorn 1220 Jun 26 14:02 04.lnk
3541544 -rw-r--r-- 8 bjorn bjorn 33055 Apr 23 18:10 04pyoslink2.jpg
3541544 -rw-r--r-- 8 bjorn bjorn 33055 Apr 23 18:10 04pyoslink3.jpg
3541544 -rw-r--r-- 8 bjorn bjorn 33055 Apr 23 18:10 04pyoslink.jpg
3542167 lrwxrwxrwx 1 bjorn bjorn 14 Jun 26 17:16 04softlink.jpg -> ./2001b/04.jpg
3541475 -rw-r--r-- 4 bjorn bjorn 64209 Jun 26 17:20 05hardlink.jpg
3541475 -rw-r--r-- 4 bjorn bjorn 64209 Jun 26 17:20 05.jpg
Which is what you wanted, right?
Python 3.4+ as command line option
Edit 2015年09月06日:
As some time has passed I choose to add another, maybe slightly more generic answer to the problem as a Linux command-line tool for the user who is not just looking for images, but duplicate files in general.
If run as
:~$ python3.4 /path/to/directory/root/that/needs/cleanup
the code will:
- Find every file in the tree
- Keep all (unique) files
- Hard-link all duplicate files to the first found unique file.
- Delete duplicate copies.
The hard-links give the advantage that the file-system will keep track of the links and only delete the file the day where there are no more hard links pointing at it. The only risk the user needs to think about is that if he/she changes the file, it affects all linked pointers. So make a copy with a new name before changing any files.
The core operations are:
def clean_up(root_path, dryrun=False, verbose=False):
seen_files = {}
for root, dirs, files in walk(root_path):
for fname in files:
fpath = path.join(root,fname)
link = path.islink(fpath)
if not link:
s256 = sha256sum(fpath)
if s256 not in seen_files:
seen_files[s256] = fpath # we've found a new file!
else:
old_pointer = fpath # there's a new name for a known file.
new_pointer = seen_files[s256] # let's save the space by symlinking, but keep the name.
The full command-line tool looks like this:
import sys
import hashlib
from os import walk, remove, link, path
def clean_up(root_path, dryrun=False, verbose=False):
stats = {'space saved': 0, 'files found': 0, 'space used': 0, 'dryrun': dryrun, 'verbose': verbose}
seen_files = {}
for root, dirs, files in walk(root_path):
for fname in files:
stats['files found'] += 1
fpath = path.join(root,fname)
link = path.islink(fpath)
size = path.getsize(fpath)
if not link:
s256 = sha256sum(fpath)
if s256 not in seen_files:
seen_files[s256] = fpath # we've found a new file!
stats['space used'] += size
else:
old_pointer = fpath # there's a new name for a known file.
new_pointer = seen_files[s256] # let's save the space by symlinking, but keep the name.
stats['space saved'] += size
if not dryrun:
symlink(old_pointer, new_pointer)
if verbose:
print("relinked {} to {}".format(old_pointer, new_pointer))
if verbose:
if not link:
type = "file"
else:
type = "link"
print(type, fpath, size, sha256sum)
if verbose:
for k, v in sorted(stats):
print("{}: {}".format(k, v))
def symlink(old, new):
remove(old)
link(new, old)
def sha256sum(target, blocksize=2*1024*1024):
with open(target, 'rb') as f:
buf = [1]
shasum = hashlib.sha256()
while len(buf) > 0:
buf = f.read(blocksize)
shasum.update(buf)
return str(shasum.hexdigest())
if __name__ == "__main__":
if len(sys.argv) < 2:
print("usage: python3.4 {} <path> [--dryrun][--verbose]".format(sys.argv[0]))
sys.exit(1)
if not path.exists(sys.argv[1]) or path.isfile(sys.argv[1]):
print("Can't find the supplied path: {}".format(sys.argv[1]))
sys.exit(1)
root_path = sys.argv[1]
dryrun, verbose = False, False
if "--dryrun" in sys.argv:
dryrun = True
if "--verbose" in sys.argv:
verbose = True
clean_up(root_path, dryrun, verbose)