I'm an intern and have an ETL project I'm doing in python.
Part of it is to make sure after I move files I delete empty directories.
I wrote this code to clean up directories that are empty and in testing it works for me but I want to make sure it won't end up deleting folders full of files.
def folderCleanup(p_rootPath):
for r, dirs, files in os.walk(p_rootPath):
for folder in dirs:
if len(os.listdir(os.path.join(r, folder))) == 0: #Checks if the folders have anything in them
shutil.rmtree(os.path.join(r, folder))
elif hasFile(os.path.join(r, folder)): #Checks if a subfolder will have a file in it
pass
else:
shutil.rmtree(os.path.join(r, folder),ignore_errors=True)
#end if
#end for
#end for
#end def
def hasFile(p_path): #Returns true if the given directory or any of its children directories have a file in them
for r, d, f in os.walk(p_path):
if len(f) > 0:
if len(f) == 1 and f[0] == 'Thumbs.db':
continue
else:
return True
#end if
#end if
#end for
return False
#end def
I am just looking for some feedback and whether there are any potential problems with this code.
2 Answers 2
I didnt check the correctness of code, but...
Small tips on comments:
Don't write
#end if
. You have identation. Why do you need it?Comments for functions should look like
def hasFile(p_path):
"""Returns true if the given directory or any of its children directories have a file in them"""
for r, d, f in os.walk(p_path):
https://saddlebackcss.github.io/tutorials/python/basic/2016/01/14/python-basics-4
Methods should be called like folder_cleanup
and has_file
. See https://visualgit.readthedocs.io/en/latest/pages/naming_convention.html
Do not name variables like this for r, d, f in os.walk(p_path):
. Make names more meaningfull
Try to reduce nesting, may be using early return
I support all comments in xander27's answer, especially RE indentation vs #end
. With regards to naming, I'll add one additional comment - instead of giving a name to a variable you never use, just name it _
Your hasFile
implementation could be much simpler; what you really want is any
return any(len(file_list) > 0 and not (len(file_list) == 1 and file_list[0] == "Thumbs.db") for _, _, file_list in os.walk(p_path))
That's a little verbose, so I'll split it up into two functions.
def file_list_has_file(file_list):
return len(file_list) > 0 and not (
len(file_list) == 1 and file_list[0] == "Thumbs.db"
)
def has_file(p_path):
return any(
file_list_has_file(file_list)
for _, _, file_list in os.walk(p_path)
)
You can improve performance/correctness by modifying dirs
; from the documentation (emphasis mine):
When
topdown
isTrue
, the caller can modify thedirnames
list in-place, andwalk()
will only recurse into the subdirectories whose names remain indirnames
; this can be used to prune the search, impose a specific order of visiting, or even to informwalk()
about directories the caller creates or renames before it resumeswalk()
again.
Once you shutil.rmtree
a directory, you should take it out of the list so that os.walk
doesn't bother continuing to recurse. This is fine if you keep your current approach, but I think I have a better option in my next section.
When you look even closer at folderCleanup
and hasFile
, you realize that they're basically the same function - one checks if there are any files in the directory, while the other checks if there are any folders in the directory and if so recurses (effectively, not literally). This is actually really weird when you realize that os.walk
already traverses all subdirectories; by doing it again in hasFiles
you're really just confusing things. Ultimately, a directory is empty if two things are true:
- It has no files (except
"Thumbs.db"
) and - It has no non-empty directories
We already have all of the information to determine whether or not this is true, and we can do so like this:
- Get the directories, bottom-up instead of top-down
- If the directory meets the requirements, keep track of it
- Once you've categorized all of them, clean house.
You don't necessarily have to get the full list of them before you clean things if you use a generator, like so:
def directory_has_files(files):
return len(files) > 0 and not (
len(files) == 1 and files[0] == "Thumbs.db"
)
def directory_has_nonempty_subdirectories(
path, subdirectories, cache
):
return any(
cache[os.path.join(path, subdirectory)]
for subdirectory in subdirectories
)
def find_empty_directories(root_path):
cache = defaultdict(lambda _: False)
for root, subdirectories, files in os.walk(
rootpath, topdown=False
):
has_files = directory_has_files(files)
if not subdirectories:
cache[root] = has_files
else:
cache[
root
] = directory_has_nonempty_subdirectories(
root, subdirectories, cache
)
if not cache[root]:
yield root
def remove_empty_directories(root_path):
for empty_directory in root_path:
shutil.rmtree(empty_directory, ignore_errors=True)
You may notice that I've used defaultdict
to make it a bit easier to find out if something is known to be empty. Otherwise, I'm taking advantage of the bottom-up approach to avoid having to calculate something repeatedly on the way down.
If you don't want to issue as many shutil.rmtree
commands, you could make it a bit more clever, and only report the highest-level empty directory. For example, you could do this at the end of find_empty_directories
to not report until you find a non-empty, and then do the children that are empty.
if cache[root]:
for subdirectory in subdirectories:
subpath = os.path.join(root, subdirectory)
if cache[subpath]:
yield subpath
Explore related questions
See similar questions with these tags.