I have the following code snippet:
for index, row in df.iterrows():
if row['MSI Status'] == 'MSS':
print(row['Patient ID'])
for dirpath, dirnames, filenames in os.walk(path):
for dirname in dirnames:
joined_path = os.path.join(path, dirname)
if row['Patient ID'] in dirname:
shutil.copytree(joined_path, os.path.join('/SeaExp/mona/MSS_Status/MSS', dirname))
if row['MSI Status'] == 'MSI-H':
print(row['Patient ID'])
for dirpath, dirnames, filenames in os.walk(path):
for dirname in dirnames:
joined_path = os.path.join(path, dirname)
if row['Patient ID'] in dirname:
shutil.copytree(joined_path, os.path.join('/SeaExp/mona/MSS_Status/MSI-H', dirname))
I only have 100 rows in my df but I have many directories in my os.walk. I understand my code is not written well or not efficient and is slow and I hope to get some feedback.
-
1\$\begingroup\$ What does the file structure look like? \$\endgroup\$Sara J– Sara J2021年10月13日 19:03:40 +00:00Commented Oct 13, 2021 at 19:03
2 Answers 2
First off, you seem to have a typo:
os.path.join(path, dirname)
should almost certainly beos.path.join(dirpath, dirname)
. The only way I could see this not causing problems is if all the directories you want to copy are immediate children ofpath
- and in that case you don't wantos.walk
at all, but should instead useos.scandir
. But going forward I'll assume thatos.walk
is in fact needed hereSince the target directory's name matches the
MSI Status
field, and those two loops are otherwise identical, it'd be easy to merge those two loops into a single one, ending with something like:destination = os.path.join("/SeaExp/mona/MSS_Status", row["MSI Status"], dirname) shutil.copytree(joined_path, destination)
If those are the only two statuses that are possible, you'd be able to get rid of the
if
s entirely. If not, you can still simplify it to something akin toif row['MSI Status'] in {'MSI-H', 'MSS'}:
Back to
os.walk
issues, once you've found a directory to copy, will you ever need to copy any subdirectory of that directory? For example, for a patient with IDsomebody
, would you want/need to copy both./somebody/
and./somebody/appointment_history/somebody_2020年12月13日
independently? If not, you should modify thedirnames
list while iterating to avoid descending into the directories you've already copied - which could perhaps look like this:for dirpath, dirnames, filenames in os.walk(path): remaining = [] for dirname in dirnames: if row['Patient ID'] in dirname: source = os.path.join(path, dirname) destination = os.path.join('/SeaExp/mona/MSS_Status', row['MSI Status'], dirname) else: remaining.append(dirname) dirnames[:] = remaining # Note the use of [:] to update in-place
Finally, pandas'
iterrows
is almost certainly faster thanos.walk
, so if we're going to do one of those multiple times, it might be best to let that be theiterrows
. You might save time by turning your code inside-out like so:for dirpath, dirnames, filenames in os.walk(path): remaining = [] for index, row in df.iterrows(): for dirname in dirnames: if row['Patient ID'] in dirname: source = os.path.join(dirpath, dirname) destination = os.path.join( '/SeaExp/mona/MSS_Status', row['MSI Status'], dirname ) shutil.copytree(source, destination) else: remaining.append(dirname) dirnames[:] = remaining
-
1\$\begingroup\$ If one were to combine our answers, your final bullet point would have the benefit of never walking down trees we've copied from ever. Gaining an even larger benefit than either of our answers would have standing alone. Nice answer. \$\endgroup\$2021年10月14日 11:06:16 +00:00Commented Oct 14, 2021 at 11:06
You should make a function to contain the
os.walk(path)
. As both walks are almost identical, the difference is the destination directory.You're not walking correctly. If I have the file structure:
──. └─test └─test_nested └─test_file
Then
os.walk
would walk down the tree, anddirnames
would only contain the directory's name (test
,test_nested
). Ok, so lets sayrow['Patient ID']
istest_nested
then we're into theif
.Now we're copying from
./test_nested
to/SeaExp/mona/MSS_Status/MSS/test_nested
. However the path totest_nested
is actually./test/test_nested
.You need to either:
- Include
dirpath
in the source and/or the destination path.os.path.join(src, dirpath, dirname)
. - Or assuming your code works correctly because
if row['Patient ID'] in dirname:
is never true in a subdirectory you need to break after the first iteration of the tree.
- Include
-
When topdown is
True
, the caller can modify the dirnames list in-place (perhaps using del or slice assignment), and walk() will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search, impose a specific order of visiting, or even to inform walk() about directories the caller creates or renames before it resumes walk() again.
Python'sos.walk
docsAs such you should
dirnames.remove(dirname)
to prevent walking the path and duplicating copies. Or just wasting time.
def copy_tree__nested(src, dest):
for dirpath, dirnames, filenames in os.walk(src):
for dirname in dirnames[:]:
if row['Patient ID'] in dirname:
dirnames.remove(dirname)
shutil.copytree(os.path.join(src, dirpath, dirname), os.path.join(dest, dirpath, dirname))
def copy_tree__tld(src, dest):
for dirpath, dirnames, filenames in os.walk(src):
for dirname in dirnames:
if row['Patient ID'] in dirname:
shutil.copytree(os.path.join(src, dirname), os.path.join(dest, dirname))
break
for index, row in df.iterrows():
if row['MSI Status'] == 'MSS':
print(row['Patient ID'])
copy_tree(path, '/SeaExp/mona/MSS_Status/MSS')
if row['MSI Status'] == 'MSI-H':
print(row['Patient ID'])
copy_tree(path, '/SeaExp/mona/MSS_Status/MSI-H')