I am writing a module that will synchronize destination to be the same as source, but will only overwrite files in destination that differ from source as to not cause excessive writes, hopefully prolonging the life of the backup. I will probably add additional functionality on top of this later.
I am primarily looking for critique regarding the style and structure of this module. The things I don't like or am not sure about:
When should or shouldn't I make a class? Is this overkill for a class? A backup is in my head an event-like object which has some state (a source and destination), but then it really only has one thing it needs to do, synchronize the destination to match source. This is probably the main question here. I was taught that OOP is the greatest and should always use it, but it just seems wrong here. I also noticed most of the modules in the Python Standard Library just have various functions and don't really break things out into classes like I was expecting (i.e. Java).
There is one method sync and then a couple utility methods that are meant to be static (not attributes of an instance). Should these be moved out into a separate "utility" submodule for the project or can they stay?
I don't like how the sync method is very similar to the sync_helper method and there is probably a way to clean that up, but I will leave that as an exercise for myself. However, is this generally how things are done when structuring recursive function calls (i.e. set things up and then call recursively when things are "aligned").
Should I have just left the code from main.py in a main function in backup.py? Would that be cleaner than splitting it out into multiple files? The
if __name__ == '__main__':
doesn't really make sense here, since this file will only ever be executed as the main file.
main.py:
#!/usr/bin/python3
"""Script to make efficient backups."""
import argparse
import os
import sys
from backup import Backup
def main():
"""Main entry point for the script."""
# Get arguments
parser = argparse.ArgumentParser()
parser.add_argument('source', help='source location to backup from')
parser.add_argument('destination', help='destination location to backup to')
parser.add_argument('-d', '--debug', help='enable debug mode', action='store_true')
args = parser.parse_args()
if not os.path.exists(args.source):
print('Error: Source \'{}\' does not exist.'.format(args.source), file=sys.stderr)
return 1
if not os.path.exists(args.destination):
print('Error: Destination \'{}\' does not exist.'.format(args.destination), file=sys.stderr)
if args.debug:
print('Backing up from \'{}\' to \'{}\''.format(args.source, args.destination))
backup = Backup(args.source, args.destination)
backup.sync()
if __name__ == '__main__':
sys.exit(main())
backup.py:
"""Class that represents a backup event."""
import hashlib
import os
import shutil
class Backup:
def __init__(self, source, destination):
self.source = source
self.destination = destination
def sync(self):
"""Synchronizes root of source and destination paths."""
sroot = os.path.normpath(self.source)
droot = os.path.normpath(self.destination) + '/' + os.path.basename(sroot)
if os.path.isdir(sroot) and os.path.isdir(droot):
Backup.sync_helper(sroot, droot)
elif os.path.isfile(sroot) and os.path.isfile(droot):
if not Backup.compare(sroot, droot):
Backup.copy(sroot, droot)
else:
Backup.copy(sroot, droot)
def sync_helper(source, destination):
"""Synchronizes source and destination."""
slist = os.listdir(source)
dlist = os.listdir(destination)
for s in slist:
scurr = source + '/' + s
dcurr = destination + '/' + s
if os.path.isdir(scurr) and os.path.isdir(dcurr):
Backup.sync_helper(scurr, dcurr)
elif os.path.isfile(scurr) and os.path.isfile(dcurr):
if not Backup.compare(scurr, dcurr):
Backup.copy(scurr, dcurr)
else:
Backup.copy(scurr, dcurr)
for d in dlist:
if d not in slist:
Backup.remove(destination + '/' + d)
def copy(source, destination):
"""Copies source file, directory, or symlink to destination"""
if os.path.isdir(source):
shutil.copytree(source, destination, symlinks=True)
else:
shutil.copy2(source, destination)
def remove(path):
"""Removes file, directory, or symlink located at path"""
if os.path.isdir(path):
shutil.rmtree(path)
else:
os.unlink(path)
def compare(source, destination):
"""Compares the SHA512 hash of source and destination."""
blocksize = 65536
shasher = hashlib.sha512()
dhasher = hashlib.sha512()
while open(source, 'rb') as sfile:
buf = sfile.read(blocksize)
while len(buf) > 0:
shasher.update(buf)
buf = sfile.read(blocksize)
while open(destination, 'rb') as dfile:
buf = dfile.read(blocksize)
while len(buf) > 0:
dhasher.update(buf)
buf = dfile.read(blocksize)
if shasher.digest() == dhasher.digest():
return True
else:
return False
-
4\$\begingroup\$ Are you trying to provide your own implementation of rsync? \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2016年10月12日 10:59:16 +00:00Commented Oct 12, 2016 at 10:59
-
\$\begingroup\$ @MathiasEttinger I actually didn't know rsync did that to be honest, thank you for that. However, I am trying to practice my Python, so either way I would've done this. \$\endgroup\$ruengemoog– ruengemoog2016年10月13日 04:48:39 +00:00Commented Oct 13, 2016 at 4:48
1 Answer 1
First, like @MatthiasEttinger suggested in the comments, have you had a look at rsync?
$ rsync -aAXEv --delete /path/to/source /path/to/backup/folder
seems to do exactly what you want. The flags (used or implied here) mean:
-a, --archive archive mode; equals -rlptgoD (no -H,-A,-X)
-r, --recursive recurse into directories
-l, --links copy symlinks as symlinks
-p, --perms preserve permissions
-t, --times preserve modification times
-g, --group preserve group
-o, --owner preserve owner (super-user only)
-D same as --devices --specials
--devices preserve device files (super-user only)
--specials preserve special files
-A, --acls preserve ACLs (implies -p)
-X, --xattrs preserve extended attributes
-E, --executability preserve executability
-v, --verbose increase verbosity
--delete delete extraneous files from dest dirs
Whenever you have a class method in Python without the self
keyword, which is not decorated with @staticmethod
, you are probably not using classes correctly.
The way it is right now, most of these methods have absolutely no link to the class, except that they are helper functions. You pass them the things they need to work on. However, you have all this information already in the class members!
First some easy fixes:
@staticmethod
def remove(path):
"""Removes file, directory, or symlink located at path"""
if os.path.isdir(path):
shutil.rmtree(path)
else:
os.unlink(path)
As mentioned above, this method should be a staticmethod.
@staticmethod
def sha512(file_name):
blocksize = 65536
hasher = hashlib.sha512()
with open(file_name, 'rb') as hash_file:
buf = hash_file.read(blocksize)
while len(buf):
hasher.update(buf)
buf = hash_file.read(blocksize)
return hasher.digest()
def compare(self, source, destination):
"""Compares the SHA512 hash of source and destination."""
return self.sha512(source) == self.sha512(destination)
For the compare
function, I would define a helper function that computes the hash. Note that while open(...) as x:
is a SyntaxError
, what you want is with open(...) as x:
. You can also directly return the result of a comparison, no need for
cond = a == b
if cond:
return True
else:
return False
Now, I would add two things to __init__
:
Already get the normpath of source and destination, so we can later use
self.source
andself.destination
(You never use these variables again).Create the target backup directory if it does not exist
class Backup:
def __init__(self, source, destination):
self.source = os.path.normpath(source)
self.destination = os.path.join(os.path.normpath(
destination), os.path.basename(source))
if not os.path.isdir(self.destination):
print "Creating destination {} because it does not exist yet.".format(self.destination)
os.makedirs(self.destination)
Finally, I would completely change Backup.sync
. I would use os.walk
to walk through the source and process it step by step. For every directory level I would first create all directories missing in that directory. Then I would copy all files in that directory, skipping identical files (compared by the hash). Lastly, I would remove all files and folders which are not present anymore in the source.
def sync(self):
"""Synchronizes source to destination."""
for dir_path, dirnames, filenames in os.walk(self.source):
# make directory structure
self.sync_directories(dir_path, dirnames)
# compare and copy files
self.sync_files(dir_path, filenames)
# check if we have to remove files or folders
self.sync_deleted(dir_path, dirnames, filenames)
The helper functions are defined thusly:
def _dest_path(self, dir_path, name):
return os.path.join(self.destination, os.path.relpath(dir_path, self.source), name)
def sync_directories(self, dir_path, dirnames):
for dir_name in dirnames:
dest_dir_name = self._dest_path(dir_path, dir_name)
if not os.path.exists(dest_dir_name):
print "mkdir", dest_dir_name
os.mkdir(dest_dir_name)
def sync_files(self, dir_path, filenames):
for file_name in filenames:
source = os.path.join(dir_path, file_name)
destination = self._dest_path(dir_path, file_name)
if os.path.isfile(destination):
if self.compare(source, destination):
# print "skip", destination
continue
print "copy", destination
shutil.copy2(source, destination)
def sync_deleted(self, dir_path, dirnames, filenames):
source_directories = set(dirnames)
source_files = set(filenames)
for _, dest_directories, dest_filenames in os.walk(self._dest_path(dir_path, "")):
dest_directories = set(dest_directories)
dest_filenames = set(dest_filenames)
break
diff_directories = dest_directories - source_directories
diff_files = dest_filenames - source_files
for to_delete in diff_directories | diff_files:
to_delete = self._dest_path(dir_path, to_delete)
print "delete", to_delete
self.remove(to_delete)
Here the helper function _dest_path(self, dir_path, name)
just takes a source path and outputs the corresponding destination path (because dir_path
in every iteration of os.walk
contains an absolute path).
Regarding splitting the main method off, I don't think this is necessary. Just keep it in the same file and guard its execution with a if __name__ == "__main__":
guard, like you already did. This way it is all in one place. This code is not complex enough to split it into different files. But if you did that you should probably make it a proper module.
Remark: This code might operate slightly different than yours (not exactly sure). If you call
b = Backup("/home/<user_name>/Documents", "/tmp/backup")
b.sync()
afterwards your backup is in /tmp/backup/Documents/
.
Final code:
import hashlib
import os
import shutil
class Backup:
def __init__(self, source, destination):
self.source = os.path.normpath(source)
self.destination = os.path.join(os.path.normpath(
destination), os.path.basename(source))
if not os.path.isdir(self.destination):
print "Creating destination {} because it does not exist yet.".format(self.destination)
os.makedirs(self.destination)
def sync(self):
"""Synchronizes source to destination."""
for dir_path, dirnames, filenames in os.walk(self.source):
self.sync_directories(dir_path, dirnames)
self.sync_files(dir_path, filenames)
self.sync_deleted(dir_path, dirnames, filenames)
def _dest_path(self, dir_path, name):
return os.path.join(self.destination, os.path.relpath(dir_path, self.source), name)
def sync_directories(self, dir_path, dirnames):
"""Create all directories in dirnames not already present in destination"""
for dir_name in dirnames:
dest_dir_name = self._dest_path(dir_path, dir_name)
if not os.path.exists(dest_dir_name):
print "mkdir", dest_dir_name
os.mkdir(dest_dir_name)
def sync_files(self, dir_path, filenames):
"""Copy all files in filenames to destination
Skips already existing files with the same hash"""
for file_name in filenames:
source = os.path.join(dir_path, file_name)
destination = self._dest_path(dir_path, file_name)
if os.path.isfile(destination):
if self.compare(source, destination):
# print "skip", destination
continue
print "copy", destination
shutil.copy2(source, destination)
def sync_deleted(self, dir_path, dirnames, filenames):
"""Delete all files and folders from destination no longer present at source"""
source_directories = set(dirnames)
source_files = set(filenames)
for _, dest_directories, dest_filenames in os.walk(self._dest_path(dir_path, "")):
dest_directories = set(dest_directories)
dest_filenames = set(dest_filenames)
break
diff_directories = dest_directories - source_directories
diff_files = dest_filenames - source_files
for to_delete in diff_directories | diff_files:
to_delete = self._dest_path(dir_path, to_delete)
print "delete", to_delete
self.remove(to_delete)
@staticmethod
def remove(path):
"""Removes file, directory, or symlink located at path"""
if os.path.isdir(path):
shutil.rmtree(path)
else:
os.unlink(path)
@staticmethod
def sha512(file_name):
blocksize = 65536
hasher = hashlib.sha512()
with open(file_name, 'rb') as hash_file:
buf = hash_file.read(blocksize)
while len(buf):
hasher.update(buf)
buf = hash_file.read(blocksize)
return hasher.digest()
def compare(self, source, destination):
"""Compares the SHA512 hash of source and destination."""
return self.sha512(source) == self.sha512(destination)
-
\$\begingroup\$ Thank you for the thorough response, I really appreciate it. That does look a lot cleaner than what I had (makes more sense as a class now too). However, one small point, I don't think this handles the case when the source is a single file, but that's a minor fix. I think you covered most of my questions, but what exactly did you mean by making it a proper module? \$\endgroup\$ruengemoog– ruengemoog2016年10月13日 04:47:04 +00:00Commented Oct 13, 2016 at 4:47
-
\$\begingroup\$ @ruengemoog Yes, this does not handle trying to back up a signle file. Should be easy to add an
if/else
clause tosync
, though. By a proper module I meant is actually a package. This is not something you need to worry about with this code, IMO. \$\endgroup\$Graipher– Graipher2016年10月13日 08:55:25 +00:00Commented Oct 13, 2016 at 8:55