The project outline:
Write a program that walks through a folder tree and searches for files with a certain file extension (such as .pdf or .jpg). Copy these files from whatever location they are in to a new folder.
My solution:
import os, shutil
from pathlib import Path
def copy_extension(basedir, newdir, extension):
for foldername, subfolders, filenames in os.walk(basedir):
for filename in filenames:
if not filename.endswith(extension):
continue
file_path = Path(foldername) / Path(filename)
destination = newdir / filename
if destination.exists():
new_destination = copy_increment(destination)
shutil.copy(file_path, new_destination)
else:
shutil.copy(file_path, destination)
def copy_increment(destination):
marker = 0
stem, extension = os.path.splitext(destination)
while destination.exists():
marker += 1
destination = Path(f"{stem} ({marker}){extension}")
return destination
def main():
while True:
user_search = input("Please enter a folder to search: ")
basedir = Path(user_search)
if not basedir.is_dir():
print("This path does not exist.")
continue
else:
user_destination = input("Please enter a the director of the new folder: ")
newdir = Path(user_destination)
extension = input("Extension of files to copy: ")
copy_extension(basedir, newdir, extension)
if __name__ == '__main__':
main()
2 Answers 2
Old style -vs- new style
You are mixing old-style os.
operations and the newer pathlib.Path
functions. Pick one ... specifically the newer version.
For instance, os.walk()
, and subsequent filtering (such as by extension) can easily be replaced with just Path.rglob()
:
source = Path(source_path)
for src in source.rglob("*.txt"):
# process the `src` file, which will be a "something.txt" file
Similarly, if destination
was a Path
:
stem, extension = os.path.splitext(destination)
...
destination = Path(f"{stem} ({marker}){extension}")
using the newer pathlib
becomes a one-line statement:
destination = destination.with_stem(f"{stem} ({marker})")
Complexity
As it stands, your script could suffer an \$O(N^2)\$ time complexity slowdown.
Consider dozens of directories, each with dozens of sub-directories, each with a dozens of sub-sub-directories, each with a README.txt
file.
You script finds ...
- the first
README.txt
file, and copies it. - the second
README.txt
file, notes it already exists, so writesREADME (1).txt
. - the third
README.txt
file, notes it already exists, notesREADME (1).txt
exists, so writesREADME (2)
.txt - the fourth
README.txt
file, notes it already exists, notesREADME (1).txt
exists, notesREADME (2).txt
exists, so writesREADME (3)
.txt
With N README.txt
files, .exists()
is called
- for
README.txt
N times, - for
README (1).txt
N-1 times, - for
README (2).txt
N-2 times, - for
README (3).txt
N-3 times, - for
README (4).txt
N-4 times, - ...
... for a total of \$N (N - 1) / 2\$ calls.
If you maintained a counter for each source filename, you wouldn't need to start at 0 and check for existence of each variant of the filename. You could simply start at the correct marker
value.
Reworked code:
import shutil
from pathlib import Path
from collections import Counter
def selective_copy(destination: Path, source: Path, pattern: str) -> None:
occurrences = Counter()
for src in source.rglob(pattern):
count = occurrences[src.name]
stem = src.stem
dst = destination / src.name
if count:
dst = dst.with_stem(f"{stem} ({count})")
while dst.exists():
count += 1
dst = dst.with_stem(f"{stem} ({count})")
shutil.copy(src, dst)
occurrences[src.name] = count + 1
if __name__ == '__main__':
selective_copy(Path("to"), Path("from"), "*.txt")
Of course, replace the main code with your querying the user for the source, destination and pattern (or extension, to which you'd add the "*." prefix).
-
\$\begingroup\$ Can you think of a reason that the author chose to mix both styles in his chapter? Specifically, in his copying files and folders example, he uses both. \$\endgroup\$Javana– Javana2022年10月13日 11:47:48 +00:00Commented Oct 13, 2022 at 11:47
-
1\$\begingroup\$ Inertia is usually the reason. Also known as "technical debt". \$\endgroup\$AJNeufeld– AJNeufeld2022年10月13日 15:40:52 +00:00Commented Oct 13, 2022 at 15:40
-
\$\begingroup\$ @ack Corrected. Thanks. \$\endgroup\$AJNeufeld– AJNeufeld2022年10月14日 18:21:33 +00:00Commented Oct 14, 2022 at 18:21
This is a straightforward and easy-to-understand solution. What follows are readability improvements.
If you are not going to use a part of a tuple, the standard way to indicate this is to use the variable name _
:
for foldername, _, filenames in os.walk(basedir):
If the first line in a loop is a filtering step, it can be incorporated into the loop definition. Instead of
for filename in filenames:
if not filename.endswith(extension):
continue
do_stuff()
write
for filename in [fn for fn in filenames if fn.endswith(extension)]:
do_stuff()
The part in the parentheses is a list comprehension. It is a concise way of expressing lists. A more advanced alternative that needs less repetition of the variable fn
uses the built-in function filter
for filename in filter(lambda fn: fn.endswith(extension), filenames):
do_stuff()
Your copy_increment()
function returns the correct file name when the file name does not already exist. The check for destination.exists()
in copy_extension()
is not necessary since that check happens within copy_increment()
. Instead of
destination = newdir / filename
if destination.exists():
new_destination = copy_increment(destination)
shutil.copy(file_path, new_destination)
else:
shutil.copy(file_path, destination)
you could write
destination = copy_increment(newdir / filename)
shutil.copy(file_path, destination)
After this line in main()
newdir = Path(user_destination)
shouldn't this folder be created if it doesn't already exist?
-
2\$\begingroup\$ I'm not sure about the advice of incorporating the list comprehension into the loop definition. With the list comprehension, you are actually looping twice, once to build the list, and again to iterate over it \$\endgroup\$C.Nivs– C.Nivs2022年10月13日 13:49:54 +00:00Commented Oct 13, 2022 at 13:49
-
\$\begingroup\$ @C.Nivs: At the very least, it should be generator comprehension (using parentheses), but even then the longer line may hinder readability, especially if there are multiple filters. \$\endgroup\$Matthieu M.– Matthieu M.2022年10月13日 16:33:52 +00:00Commented Oct 13, 2022 at 16:33
fire
more, as it is simpler. Butclick
allows for more customisation \$\endgroup\$