Program to delete directories that do not contain a file

Question 1

I'm an intern and have an ETL project I'm doing in python.

Part of it is to make sure after I move files I delete empty directories.

I wrote this code to clean up directories that are empty and in testing it works for me but I want to make sure it won't end up deleting folders full of files.

def folderCleanup(p_rootPath): 
 for r, dirs, files in os.walk(p_rootPath):
 for folder in dirs:
 if len(os.listdir(os.path.join(r, folder))) == 0: #Checks if the folders have anything in them
 shutil.rmtree(os.path.join(r, folder))
 elif hasFile(os.path.join(r, folder)): #Checks if a subfolder will have a file in it
 pass
 else: 
 shutil.rmtree(os.path.join(r, folder),ignore_errors=True)
 #end if
 #end for
 #end for
#end def
def hasFile(p_path): #Returns true if the given directory or any of its children directories have a file in them
 for r, d, f in os.walk(p_path):
 if len(f) > 0:
 if len(f) == 1 and f[0] == 'Thumbs.db':
 continue
 else:
 return True
 #end if
 #end if
 #end for
 return False
#end def

I am just looking for some feedback and whether there are any potential problems with this code.

Question 2

I didnt check the correctness of code, but...

Small tips on comments:

Don't write #end if. You have identation. Why do you need it?
Comments for functions should look like

def hasFile(p_path):
 """Returns true if the given directory or any of its children directories have a file in them"""
 for r, d, f in os.walk(p_path):

https://saddlebackcss.github.io/tutorials/python/basic/2016/01/14/python-basics-4

Methods should be called like folder_cleanupand has_file. See https://visualgit.readthedocs.io/en/latest/pages/naming_convention.html

Do not name variables like this for r, d, f in os.walk(p_path):. Make names more meaningfull

Try to reduce nesting, may be using early return

Question 3

I support all comments in xander27's answer, especially RE indentation vs #end. With regards to naming, I'll add one additional comment - instead of giving a name to a variable you never use, just name it _

Your hasFile implementation could be much simpler; what you really want is any

return any(len(file_list) > 0 and not (len(file_list) == 1 and file_list[0] == "Thumbs.db") for _, _, file_list in os.walk(p_path))

That's a little verbose, so I'll split it up into two functions.

def file_list_has_file(file_list):
 return len(file_list) > 0 and not (
 len(file_list) == 1 and file_list[0] == "Thumbs.db"
 )
def has_file(p_path):
 return any(
 file_list_has_file(file_list)
 for _, _, file_list in os.walk(p_path)
 )

You can improve performance/correctness by modifying dirs; from the documentation (emphasis mine):

When topdown is True, the caller can modify the dirnames list in-place, and walk() will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search, impose a specific order of visiting, or even to inform walk() about directories the caller creates or renames before it resumes walk() again.

Once you shutil.rmtree a directory, you should take it out of the list so that os.walk doesn't bother continuing to recurse. This is fine if you keep your current approach, but I think I have a better option in my next section.

When you look even closer at folderCleanup and hasFile, you realize that they're basically the same function - one checks if there are any files in the directory, while the other checks if there are any folders in the directory and if so recurses (effectively, not literally). This is actually really weird when you realize that os.walk already traverses all subdirectories; by doing it again in hasFiles you're really just confusing things. Ultimately, a directory is empty if two things are true:

It has no files (except "Thumbs.db") and
It has no non-empty directories

We already have all of the information to determine whether or not this is true, and we can do so like this:

Get the directories, bottom-up instead of top-down
If the directory meets the requirements, keep track of it
Once you've categorized all of them, clean house.

You don't necessarily have to get the full list of them before you clean things if you use a generator, like so:

def directory_has_files(files):
 return len(files) > 0 and not (
 len(files) == 1 and files[0] == "Thumbs.db"
 )
def directory_has_nonempty_subdirectories(
 path, subdirectories, cache
):
 return any(
 cache[os.path.join(path, subdirectory)]
 for subdirectory in subdirectories
 )
def find_empty_directories(root_path):
 cache = defaultdict(lambda _: False)
 for root, subdirectories, files in os.walk(
 rootpath, topdown=False
 ):
 has_files = directory_has_files(files)
 if not subdirectories:
 cache[root] = has_files
 else:
 cache[
 root
 ] = directory_has_nonempty_subdirectories(
 root, subdirectories, cache
 )
 if not cache[root]:
 yield root
def remove_empty_directories(root_path):
 for empty_directory in root_path:
 shutil.rmtree(empty_directory, ignore_errors=True)

You may notice that I've used defaultdict to make it a bit easier to find out if something is known to be empty. Otherwise, I'm taking advantage of the bottom-up approach to avoid having to calculate something repeatedly on the way down.

If you don't want to issue as many shutil.rmtree commands, you could make it a bit more clever, and only report the highest-level empty directory. For example, you could do this at the end of find_empty_directories to not report until you find a non-empty, and then do the children that are empty.

if cache[root]:
 for subdirectory in subdirectories:
 subpath = os.path.join(root, subdirectory) 
 if cache[subpath]:
 yield subpath

xander27 xander27 1561 bronze badge · Answer 1 · 2019-07-16 07:22:55Z

I didnt check the correctness of code, but...

Small tips on comments:

Don't write #end if. You have identation. Why do you need it?
Comments for functions should look like

def hasFile(p_path):
 """Returns true if the given directory or any of its children directories have a file in them"""
 for r, d, f in os.walk(p_path):

https://saddlebackcss.github.io/tutorials/python/basic/2016/01/14/python-basics-4

Methods should be called like folder_cleanupand has_file. See https://visualgit.readthedocs.io/en/latest/pages/naming_convention.html

Do not name variables like this for r, d, f in os.walk(p_path):. Make names more meaningfull

Try to reduce nesting, may be using early return

Dan Oberlam Dan Oberlam 8,0492 gold badges33 silver badges74 bronze badges · Answer 2 · 2019-09-06 15:51:57Z

I support all comments in xander27's answer, especially RE indentation vs #end. With regards to naming, I'll add one additional comment - instead of giving a name to a variable you never use, just name it _

Your hasFile implementation could be much simpler; what you really want is any

return any(len(file_list) > 0 and not (len(file_list) == 1 and file_list[0] == "Thumbs.db") for _, _, file_list in os.walk(p_path))

That's a little verbose, so I'll split it up into two functions.

def file_list_has_file(file_list):
 return len(file_list) > 0 and not (
 len(file_list) == 1 and file_list[0] == "Thumbs.db"
 )
def has_file(p_path):
 return any(
 file_list_has_file(file_list)
 for _, _, file_list in os.walk(p_path)
 )

You can improve performance/correctness by modifying dirs; from the documentation (emphasis mine):

When topdown is True, the caller can modify the dirnames list in-place, and walk() will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search, impose a specific order of visiting, or even to inform walk() about directories the caller creates or renames before it resumes walk() again.

Once you shutil.rmtree a directory, you should take it out of the list so that os.walk doesn't bother continuing to recurse. This is fine if you keep your current approach, but I think I have a better option in my next section.

When you look even closer at folderCleanup and hasFile, you realize that they're basically the same function - one checks if there are any files in the directory, while the other checks if there are any folders in the directory and if so recurses (effectively, not literally). This is actually really weird when you realize that os.walk already traverses all subdirectories; by doing it again in hasFiles you're really just confusing things. Ultimately, a directory is empty if two things are true:

It has no files (except "Thumbs.db") and
It has no non-empty directories

We already have all of the information to determine whether or not this is true, and we can do so like this:

Get the directories, bottom-up instead of top-down
If the directory meets the requirements, keep track of it
Once you've categorized all of them, clean house.

You don't necessarily have to get the full list of them before you clean things if you use a generator, like so:

def directory_has_files(files):
 return len(files) > 0 and not (
 len(files) == 1 and files[0] == "Thumbs.db"
 )
def directory_has_nonempty_subdirectories(
 path, subdirectories, cache
):
 return any(
 cache[os.path.join(path, subdirectory)]
 for subdirectory in subdirectories
 )
def find_empty_directories(root_path):
 cache = defaultdict(lambda _: False)
 for root, subdirectories, files in os.walk(
 rootpath, topdown=False
 ):
 has_files = directory_has_files(files)
 if not subdirectories:
 cache[root] = has_files
 else:
 cache[
 root
 ] = directory_has_nonempty_subdirectories(
 root, subdirectories, cache
 )
 if not cache[root]:
 yield root
def remove_empty_directories(root_path):
 for empty_directory in root_path:
 shutil.rmtree(empty_directory, ignore_errors=True)

You may notice that I've used defaultdict to make it a bit easier to find out if something is known to be empty. Otherwise, I'm taking advantage of the bottom-up approach to avoid having to calculate something repeatedly on the way down.

If you don't want to issue as many shutil.rmtree commands, you could make it a bit more clever, and only report the highest-level empty directory. For example, you could do this at the end of find_empty_directories to not report until you find a non-empty, and then do the children that are empty.

if cache[root]:
 for subdirectory in subdirectories:
 subpath = os.path.join(root, subdirectory) 
 if cache[subpath]:
 yield subpath

Stack Exchange Network

Program to delete directories that do not contain a file

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Program to delete directories that do not contain a file

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions