Automate the Boring Stuff Chapter 10 - Selective Copy by file extension

Question 1

Write a program that walks through a folder tree and searches for files with a certain file extension (such as .pdf or .jpg). Copy these files from whatever location they are in to a new folder.

My solution:

import os, shutil
from pathlib import Path
def copy_extension(basedir, newdir, extension):
 for foldername, subfolders, filenames in os.walk(basedir):
 for filename in filenames:
 if not filename.endswith(extension):
 continue
 file_path = Path(foldername) / Path(filename)
 destination = newdir / filename
 if destination.exists():
 new_destination = copy_increment(destination)
 shutil.copy(file_path, new_destination)
 else: 
 shutil.copy(file_path, destination)
def copy_increment(destination):
 marker = 0
 stem, extension = os.path.splitext(destination)
 while destination.exists():
 marker += 1
 destination = Path(f"{stem} ({marker}){extension}")
 return destination
def main():
 while True:
 user_search = input("Please enter a folder to search: ")
 basedir = Path(user_search)
 if not basedir.is_dir():
 print("This path does not exist.")
 continue
 else:
 user_destination = input("Please enter a the director of the new folder: ")
 newdir = Path(user_destination)
 extension = input("Extension of files to copy: ")
 copy_extension(basedir, newdir, extension)
if __name__ == '__main__':
 main()

Question 2

Besides what has already been pointed out in the answers, I would replace the complexity of dealing with user input by using an existing CLI library like fire or click. Personally, I like fire more, as it is simpler. But click allows for more customisation

Question 3

Old style -vs- new style

You are mixing old-style os. operations and the newer pathlib.Path functions. Pick one ... specifically the newer version.

For instance, os.walk(), and subsequent filtering (such as by extension) can easily be replaced with just Path.rglob():

 source = Path(source_path)
 for src in source.rglob("*.txt"):
 # process the `src` file, which will be a "something.txt" file

Similarly, if destination was a Path:

 stem, extension = os.path.splitext(destination)
...
 destination = Path(f"{stem} ({marker}){extension}")

using the newer pathlib becomes a one-line statement:

 destination = destination.with_stem(f"{stem} ({marker})")

Complexity

As it stands, your script could suffer an \$O(N^2)\$ time complexity slowdown.

Consider dozens of directories, each with dozens of sub-directories, each with a dozens of sub-sub-directories, each with a README.txt file.

You script finds ...

the first README.txt file, and copies it.
the second README.txt file, notes it already exists, so writes README (1).txt.
the third README.txt file, notes it already exists, notes README (1).txt exists, so writes README (2).txt
the fourthREADME.txt file, notes it already exists, notes README (1).txt exists, notes README (2).txt exists, so writes README (3).txt

With N README.txt files, .exists() is called

for README.txt N times,
for README (1).txt N-1 times,
for README (2).txt N-2 times,
for README (3).txt N-3 times,
for README (4).txt N-4 times,
...

... for a total of \$N (N - 1) / 2\$ calls.

If you maintained a counter for each source filename, you wouldn't need to start at 0 and check for existence of each variant of the filename. You could simply start at the correct marker value.

Reworked code:

import shutil
from pathlib import Path
from collections import Counter
def selective_copy(destination: Path, source: Path, pattern: str) -> None:
 occurrences = Counter()
 for src in source.rglob(pattern):
 count = occurrences[src.name]
 stem = src.stem
 dst = destination / src.name
 if count:
 dst = dst.with_stem(f"{stem} ({count})")
 while dst.exists():
 count += 1
 dst = dst.with_stem(f"{stem} ({count})")
 shutil.copy(src, dst)
 occurrences[src.name] = count + 1
if __name__ == '__main__':
 selective_copy(Path("to"), Path("from"), "*.txt")

Of course, replace the main code with your querying the user for the source, destination and pattern (or extension, to which you'd add the "*." prefix).

Question 4

Can you think of a reason that the author chose to mix both styles in his chapter? Specifically, in his copying files and folders example, he uses both.

Question 5

Inertia is usually the reason. Also known as "technical debt".

Question 6

@ack Corrected. Thanks.

Question 7

This is a straightforward and easy-to-understand solution. What follows are readability improvements.

If you are not going to use a part of a tuple, the standard way to indicate this is to use the variable name _:

 for foldername, _, filenames in os.walk(basedir):

If the first line in a loop is a filtering step, it can be incorporated into the loop definition. Instead of

 for filename in filenames:
 if not filename.endswith(extension):
 continue
 do_stuff()

write

 for filename in [fn for fn in filenames if fn.endswith(extension)]:
 do_stuff()

The part in the parentheses is a list comprehension. It is a concise way of expressing lists. A more advanced alternative that needs less repetition of the variable fn uses the built-in function filter

 for filename in filter(lambda fn: fn.endswith(extension), filenames):
 do_stuff()

Your copy_increment() function returns the correct file name when the file name does not already exist. The check for destination.exists() in copy_extension() is not necessary since that check happens within copy_increment(). Instead of

 destination = newdir / filename
 if destination.exists():
 new_destination = copy_increment(destination)
 shutil.copy(file_path, new_destination)
 else: 
 shutil.copy(file_path, destination)

you could write

 destination = copy_increment(newdir / filename)
 shutil.copy(file_path, destination)

After this line in main()

 newdir = Path(user_destination)

shouldn't this folder be created if it doesn't already exist?

Question 8

I'm not sure about the advice of incorporating the list comprehension into the loop definition. With the list comprehension, you are actually looping twice, once to build the list, and again to iterate over it

Question 9

@C.Nivs: At the very least, it should be generator comprehension (using parentheses), but even then the longer line may hinder readability, especially if there are multiple filters.

AJNeufeld AJNeufeld 35.2k5 gold badges41 silver badges103 bronze badges · Accepted Answer · 2022-10-13 08:50:07Z

Old style -vs- new style

You are mixing old-style os. operations and the newer pathlib.Path functions. Pick one ... specifically the newer version.

For instance, os.walk(), and subsequent filtering (such as by extension) can easily be replaced with just Path.rglob():

 source = Path(source_path)
 for src in source.rglob("*.txt"):
 # process the `src` file, which will be a "something.txt" file

Similarly, if destination was a Path:

 stem, extension = os.path.splitext(destination)
...
 destination = Path(f"{stem} ({marker}){extension}")

using the newer pathlib becomes a one-line statement:

 destination = destination.with_stem(f"{stem} ({marker})")

Complexity

As it stands, your script could suffer an \$O(N^2)\$ time complexity slowdown.

Consider dozens of directories, each with dozens of sub-directories, each with a dozens of sub-sub-directories, each with a README.txt file.

You script finds ...

the first README.txt file, and copies it.
the second README.txt file, notes it already exists, so writes README (1).txt.
the third README.txt file, notes it already exists, notes README (1).txt exists, so writes README (2).txt
the fourthREADME.txt file, notes it already exists, notes README (1).txt exists, notes README (2).txt exists, so writes README (3).txt

With N README.txt files, .exists() is called

for README.txt N times,
for README (1).txt N-1 times,
for README (2).txt N-2 times,
for README (3).txt N-3 times,
for README (4).txt N-4 times,
...

... for a total of \$N (N - 1) / 2\$ calls.

If you maintained a counter for each source filename, you wouldn't need to start at 0 and check for existence of each variant of the filename. You could simply start at the correct marker value.

Reworked code:

import shutil
from pathlib import Path
from collections import Counter
def selective_copy(destination: Path, source: Path, pattern: str) -> None:
 occurrences = Counter()
 for src in source.rglob(pattern):
 count = occurrences[src.name]
 stem = src.stem
 dst = destination / src.name
 if count:
 dst = dst.with_stem(f"{stem} ({count})")
 while dst.exists():
 count += 1
 dst = dst.with_stem(f"{stem} ({count})")
 shutil.copy(src, dst)
 occurrences[src.name] = count + 1
if __name__ == '__main__':
 selective_copy(Path("to"), Path("from"), "*.txt")

Of course, replace the main code with your querying the user for the source, destination and pattern (or extension, to which you'd add the "*." prefix).

Can you think of a reason that the author chose to mix both styles in his chapter? Specifically, in his copying files and folders example, he uses both.
Inertia is usually the reason. Also known as "technical debt".

Stack Exchange Network

Automate the Boring Stuff Chapter 10 - Selective Copy by file extension

2 Answers 2

Old style -vs- new style

Complexity

Reworked code:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Automate the Boring Stuff Chapter 10 - Selective Copy by file extension

2 Answers 2

Old style -vs- new style

Complexity

Reworked code:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions