4
\$\begingroup\$

I am writing a module that will synchronize destination to be the same as source, but will only overwrite files in destination that differ from source as to not cause excessive writes, hopefully prolonging the life of the backup. I will probably add additional functionality on top of this later.

I am primarily looking for critique regarding the style and structure of this module. The things I don't like or am not sure about:

  1. When should or shouldn't I make a class? Is this overkill for a class? A backup is in my head an event-like object which has some state (a source and destination), but then it really only has one thing it needs to do, synchronize the destination to match source. This is probably the main question here. I was taught that OOP is the greatest and should always use it, but it just seems wrong here. I also noticed most of the modules in the Python Standard Library just have various functions and don't really break things out into classes like I was expecting (i.e. Java).

  2. There is one method sync and then a couple utility methods that are meant to be static (not attributes of an instance). Should these be moved out into a separate "utility" submodule for the project or can they stay?

  3. I don't like how the sync method is very similar to the sync_helper method and there is probably a way to clean that up, but I will leave that as an exercise for myself. However, is this generally how things are done when structuring recursive function calls (i.e. set things up and then call recursively when things are "aligned").

  4. Should I have just left the code from main.py in a main function in backup.py? Would that be cleaner than splitting it out into multiple files? The if __name__ == '__main__': doesn't really make sense here, since this file will only ever be executed as the main file.

main.py:

#!/usr/bin/python3
"""Script to make efficient backups."""
import argparse
import os
import sys
from backup import Backup
def main():
 """Main entry point for the script."""
 # Get arguments
 parser = argparse.ArgumentParser()
 parser.add_argument('source', help='source location to backup from')
 parser.add_argument('destination', help='destination location to backup to')
 parser.add_argument('-d', '--debug', help='enable debug mode', action='store_true')
 args = parser.parse_args()
 if not os.path.exists(args.source):
 print('Error: Source \'{}\' does not exist.'.format(args.source), file=sys.stderr)
 return 1
 if not os.path.exists(args.destination):
 print('Error: Destination \'{}\' does not exist.'.format(args.destination), file=sys.stderr)
 if args.debug:
 print('Backing up from \'{}\' to \'{}\''.format(args.source, args.destination))
 backup = Backup(args.source, args.destination)
 backup.sync()
if __name__ == '__main__':
 sys.exit(main())

backup.py:

"""Class that represents a backup event."""
import hashlib
import os
import shutil
class Backup:
 def __init__(self, source, destination):
 self.source = source
 self.destination = destination
 def sync(self):
 """Synchronizes root of source and destination paths."""
 sroot = os.path.normpath(self.source)
 droot = os.path.normpath(self.destination) + '/' + os.path.basename(sroot)
 if os.path.isdir(sroot) and os.path.isdir(droot):
 Backup.sync_helper(sroot, droot)
 elif os.path.isfile(sroot) and os.path.isfile(droot):
 if not Backup.compare(sroot, droot):
 Backup.copy(sroot, droot)
 else:
 Backup.copy(sroot, droot)
 def sync_helper(source, destination):
 """Synchronizes source and destination."""
 slist = os.listdir(source)
 dlist = os.listdir(destination)
 for s in slist:
 scurr = source + '/' + s
 dcurr = destination + '/' + s
 if os.path.isdir(scurr) and os.path.isdir(dcurr):
 Backup.sync_helper(scurr, dcurr)
 elif os.path.isfile(scurr) and os.path.isfile(dcurr):
 if not Backup.compare(scurr, dcurr):
 Backup.copy(scurr, dcurr)
 else:
 Backup.copy(scurr, dcurr)
 for d in dlist:
 if d not in slist:
 Backup.remove(destination + '/' + d)
 def copy(source, destination):
 """Copies source file, directory, or symlink to destination"""
 if os.path.isdir(source):
 shutil.copytree(source, destination, symlinks=True)
 else:
 shutil.copy2(source, destination)
 def remove(path):
 """Removes file, directory, or symlink located at path"""
 if os.path.isdir(path):
 shutil.rmtree(path)
 else:
 os.unlink(path)
 def compare(source, destination):
 """Compares the SHA512 hash of source and destination."""
 blocksize = 65536
 shasher = hashlib.sha512()
 dhasher = hashlib.sha512()
 while open(source, 'rb') as sfile:
 buf = sfile.read(blocksize)
 while len(buf) > 0:
 shasher.update(buf)
 buf = sfile.read(blocksize)
 while open(destination, 'rb') as dfile:
 buf = dfile.read(blocksize)
 while len(buf) > 0:
 dhasher.update(buf)
 buf = dfile.read(blocksize)
 if shasher.digest() == dhasher.digest():
 return True
 else:
 return False
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Oct 12, 2016 at 4:05
\$\endgroup\$
2
  • 4
    \$\begingroup\$ Are you trying to provide your own implementation of rsync? \$\endgroup\$ Commented Oct 12, 2016 at 10:59
  • \$\begingroup\$ @MathiasEttinger I actually didn't know rsync did that to be honest, thank you for that. However, I am trying to practice my Python, so either way I would've done this. \$\endgroup\$ Commented Oct 13, 2016 at 4:48

1 Answer 1

4
\$\begingroup\$

First, like @MatthiasEttinger suggested in the comments, have you had a look at rsync?

$ rsync -aAXEv --delete /path/to/source /path/to/backup/folder

seems to do exactly what you want. The flags (used or implied here) mean:

-a, --archive archive mode; equals -rlptgoD (no -H,-A,-X)
-r, --recursive recurse into directories
-l, --links copy symlinks as symlinks
-p, --perms preserve permissions
-t, --times preserve modification times
-g, --group preserve group
-o, --owner preserve owner (super-user only)
-D same as --devices --specials
 --devices preserve device files (super-user only)
 --specials preserve special files
-A, --acls preserve ACLs (implies -p)
-X, --xattrs preserve extended attributes
-E, --executability preserve executability
-v, --verbose increase verbosity
 --delete delete extraneous files from dest dirs

Whenever you have a class method in Python without the self keyword, which is not decorated with @staticmethod, you are probably not using classes correctly.

The way it is right now, most of these methods have absolutely no link to the class, except that they are helper functions. You pass them the things they need to work on. However, you have all this information already in the class members!

First some easy fixes:

@staticmethod
def remove(path):
 """Removes file, directory, or symlink located at path"""
 if os.path.isdir(path):
 shutil.rmtree(path)
 else:
 os.unlink(path)

As mentioned above, this method should be a staticmethod.

@staticmethod
def sha512(file_name):
 blocksize = 65536
 hasher = hashlib.sha512()
 with open(file_name, 'rb') as hash_file:
 buf = hash_file.read(blocksize)
 while len(buf):
 hasher.update(buf)
 buf = hash_file.read(blocksize)
 return hasher.digest()
def compare(self, source, destination):
 """Compares the SHA512 hash of source and destination."""
 return self.sha512(source) == self.sha512(destination)

For the compare function, I would define a helper function that computes the hash. Note that while open(...) as x: is a SyntaxError, what you want is with open(...) as x:. You can also directly return the result of a comparison, no need for

cond = a == b
if cond:
 return True
else:
 return False

Now, I would add two things to __init__:

  1. Already get the normpath of source and destination, so we can later use self.source and self.destination (You never use these variables again).

  2. Create the target backup directory if it does not exist

class Backup:
 def __init__(self, source, destination):
 self.source = os.path.normpath(source)
 self.destination = os.path.join(os.path.normpath(
 destination), os.path.basename(source))
 if not os.path.isdir(self.destination):
 print "Creating destination {} because it does not exist yet.".format(self.destination)
 os.makedirs(self.destination)

Finally, I would completely change Backup.sync. I would use os.walk to walk through the source and process it step by step. For every directory level I would first create all directories missing in that directory. Then I would copy all files in that directory, skipping identical files (compared by the hash). Lastly, I would remove all files and folders which are not present anymore in the source.

def sync(self):
 """Synchronizes source to destination."""
 for dir_path, dirnames, filenames in os.walk(self.source):
 # make directory structure
 self.sync_directories(dir_path, dirnames)
 # compare and copy files
 self.sync_files(dir_path, filenames)
 # check if we have to remove files or folders
 self.sync_deleted(dir_path, dirnames, filenames)

The helper functions are defined thusly:

def _dest_path(self, dir_path, name):
 return os.path.join(self.destination, os.path.relpath(dir_path, self.source), name)
def sync_directories(self, dir_path, dirnames):
 for dir_name in dirnames:
 dest_dir_name = self._dest_path(dir_path, dir_name)
 if not os.path.exists(dest_dir_name):
 print "mkdir", dest_dir_name
 os.mkdir(dest_dir_name)
def sync_files(self, dir_path, filenames):
 for file_name in filenames:
 source = os.path.join(dir_path, file_name)
 destination = self._dest_path(dir_path, file_name)
 if os.path.isfile(destination):
 if self.compare(source, destination):
 # print "skip", destination
 continue
 print "copy", destination
 shutil.copy2(source, destination)
def sync_deleted(self, dir_path, dirnames, filenames):
 source_directories = set(dirnames)
 source_files = set(filenames)
 for _, dest_directories, dest_filenames in os.walk(self._dest_path(dir_path, "")):
 dest_directories = set(dest_directories)
 dest_filenames = set(dest_filenames)
 break
 diff_directories = dest_directories - source_directories
 diff_files = dest_filenames - source_files
 for to_delete in diff_directories | diff_files:
 to_delete = self._dest_path(dir_path, to_delete)
 print "delete", to_delete
 self.remove(to_delete)

Here the helper function _dest_path(self, dir_path, name) just takes a source path and outputs the corresponding destination path (because dir_path in every iteration of os.walkcontains an absolute path).

Regarding splitting the main method off, I don't think this is necessary. Just keep it in the same file and guard its execution with a if __name__ == "__main__": guard, like you already did. This way it is all in one place. This code is not complex enough to split it into different files. But if you did that you should probably make it a proper module.

Remark: This code might operate slightly different than yours (not exactly sure). If you call

b = Backup("/home/<user_name>/Documents", "/tmp/backup")
b.sync()

afterwards your backup is in /tmp/backup/Documents/.

Final code:

import hashlib
import os
import shutil
class Backup:
 def __init__(self, source, destination):
 self.source = os.path.normpath(source)
 self.destination = os.path.join(os.path.normpath(
 destination), os.path.basename(source))
 if not os.path.isdir(self.destination):
 print "Creating destination {} because it does not exist yet.".format(self.destination)
 os.makedirs(self.destination)
 def sync(self):
 """Synchronizes source to destination."""
 for dir_path, dirnames, filenames in os.walk(self.source):
 self.sync_directories(dir_path, dirnames)
 self.sync_files(dir_path, filenames)
 self.sync_deleted(dir_path, dirnames, filenames)
 def _dest_path(self, dir_path, name):
 return os.path.join(self.destination, os.path.relpath(dir_path, self.source), name)
 def sync_directories(self, dir_path, dirnames):
 """Create all directories in dirnames not already present in destination"""
 for dir_name in dirnames:
 dest_dir_name = self._dest_path(dir_path, dir_name)
 if not os.path.exists(dest_dir_name):
 print "mkdir", dest_dir_name
 os.mkdir(dest_dir_name)
 def sync_files(self, dir_path, filenames):
 """Copy all files in filenames to destination
 Skips already existing files with the same hash"""
 for file_name in filenames:
 source = os.path.join(dir_path, file_name)
 destination = self._dest_path(dir_path, file_name)
 if os.path.isfile(destination):
 if self.compare(source, destination):
 # print "skip", destination
 continue
 print "copy", destination
 shutil.copy2(source, destination)
 def sync_deleted(self, dir_path, dirnames, filenames):
 """Delete all files and folders from destination no longer present at source"""
 source_directories = set(dirnames)
 source_files = set(filenames)
 for _, dest_directories, dest_filenames in os.walk(self._dest_path(dir_path, "")):
 dest_directories = set(dest_directories)
 dest_filenames = set(dest_filenames)
 break
 diff_directories = dest_directories - source_directories
 diff_files = dest_filenames - source_files
 for to_delete in diff_directories | diff_files:
 to_delete = self._dest_path(dir_path, to_delete)
 print "delete", to_delete
 self.remove(to_delete)
 @staticmethod
 def remove(path):
 """Removes file, directory, or symlink located at path"""
 if os.path.isdir(path):
 shutil.rmtree(path)
 else:
 os.unlink(path)
 @staticmethod
 def sha512(file_name):
 blocksize = 65536
 hasher = hashlib.sha512()
 with open(file_name, 'rb') as hash_file:
 buf = hash_file.read(blocksize)
 while len(buf):
 hasher.update(buf)
 buf = hash_file.read(blocksize)
 return hasher.digest()
 def compare(self, source, destination):
 """Compares the SHA512 hash of source and destination."""
 return self.sha512(source) == self.sha512(destination)
answered Oct 12, 2016 at 10:28
\$\endgroup\$
2
  • \$\begingroup\$ Thank you for the thorough response, I really appreciate it. That does look a lot cleaner than what I had (makes more sense as a class now too). However, one small point, I don't think this handles the case when the source is a single file, but that's a minor fix. I think you covered most of my questions, but what exactly did you mean by making it a proper module? \$\endgroup\$ Commented Oct 13, 2016 at 4:47
  • \$\begingroup\$ @ruengemoog Yes, this does not handle trying to back up a signle file. Should be easy to add an if/else clause to sync, though. By a proper module I meant is actually a package. This is not something you need to worry about with this code, IMO. \$\endgroup\$ Commented Oct 13, 2016 at 8:55

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.