1
\$\begingroup\$

I have the following code snippet:

for index, row in df.iterrows():
 if row['MSI Status'] == 'MSS':
 print(row['Patient ID'])
 
 for dirpath, dirnames, filenames in os.walk(path):
 for dirname in dirnames:
 joined_path = os.path.join(path, dirname)
 if row['Patient ID'] in dirname:
 shutil.copytree(joined_path, os.path.join('/SeaExp/mona/MSS_Status/MSS', dirname))
 if row['MSI Status'] == 'MSI-H':
 print(row['Patient ID'])
 
 for dirpath, dirnames, filenames in os.walk(path):
 for dirname in dirnames:
 joined_path = os.path.join(path, dirname)
 if row['Patient ID'] in dirname:
 shutil.copytree(joined_path, os.path.join('/SeaExp/mona/MSS_Status/MSI-H', dirname))

I only have 100 rows in my df but I have many directories in my os.walk. I understand my code is not written well or not efficient and is slow and I hope to get some feedback.

BCdotWEB
11.4k2 gold badges28 silver badges45 bronze badges
asked Oct 13, 2021 at 17:55
\$\endgroup\$
1
  • 1
    \$\begingroup\$ What does the file structure look like? \$\endgroup\$ Commented Oct 13, 2021 at 19:03

2 Answers 2

5
\$\begingroup\$
  • First off, you seem to have a typo: os.path.join(path, dirname) should almost certainly be os.path.join(dirpath, dirname). The only way I could see this not causing problems is if all the directories you want to copy are immediate children of path - and in that case you don't want os.walk at all, but should instead use os.scandir. But going forward I'll assume that os.walk is in fact needed here

  • Since the target directory's name matches the MSI Status field, and those two loops are otherwise identical, it'd be easy to merge those two loops into a single one, ending with something like:

    destination = os.path.join("/SeaExp/mona/MSS_Status", row["MSI Status"], dirname)
    shutil.copytree(joined_path, destination)
    
  • If those are the only two statuses that are possible, you'd be able to get rid of the ifs entirely. If not, you can still simplify it to something akin to if row['MSI Status'] in {'MSI-H', 'MSS'}:

  • Back to os.walk issues, once you've found a directory to copy, will you ever need to copy any subdirectory of that directory? For example, for a patient with ID somebody, would you want/need to copy both ./somebody/ and ./somebody/appointment_history/somebody_2020年12月13日 independently? If not, you should modify the dirnames list while iterating to avoid descending into the directories you've already copied - which could perhaps look like this:

    for dirpath, dirnames, filenames in os.walk(path):
     remaining = []
     for dirname in dirnames:
     if row['Patient ID'] in dirname:
     source = os.path.join(path, dirname)
     destination = os.path.join('/SeaExp/mona/MSS_Status', row['MSI Status'], dirname)
     else:
     remaining.append(dirname)
     dirnames[:] = remaining # Note the use of [:] to update in-place
    
  • Finally, pandas' iterrows is almost certainly faster than os.walk, so if we're going to do one of those multiple times, it might be best to let that be the iterrows. You might save time by turning your code inside-out like so:

    for dirpath, dirnames, filenames in os.walk(path):
     remaining = []
     for index, row in df.iterrows():
     for dirname in dirnames:
     if row['Patient ID'] in dirname:
     source = os.path.join(dirpath, dirname)
     destination = os.path.join(
     '/SeaExp/mona/MSS_Status',
     row['MSI Status'],
     dirname
     )
     shutil.copytree(source, destination)
     else:
     remaining.append(dirname)
     dirnames[:] = remaining
    
tdy
2,2661 gold badge10 silver badges21 bronze badges
answered Oct 13, 2021 at 20:30
\$\endgroup\$
1
  • 1
    \$\begingroup\$ If one were to combine our answers, your final bullet point would have the benefit of never walking down trees we've copied from ever. Gaining an even larger benefit than either of our answers would have standing alone. Nice answer. \$\endgroup\$ Commented Oct 14, 2021 at 11:06
3
\$\begingroup\$
  • You should make a function to contain the os.walk(path). As both walks are almost identical, the difference is the destination directory.

  • You're not walking correctly. If I have the file structure:

    ──.
     └─test
     └─test_nested
     └─test_file
    

    Then os.walk would walk down the tree, and dirnames would only contain the directory's name (test, test_nested). Ok, so lets say row['Patient ID'] is test_nested then we're into the if.

    Now we're copying from ./test_nested to /SeaExp/mona/MSS_Status/MSS/test_nested. However the path to test_nested is actually ./test/test_nested.

    You need to either:

    1. Include dirpath in the source and/or the destination path. os.path.join(src, dirpath, dirname).
    2. Or assuming your code works correctly because if row['Patient ID'] in dirname: is never true in a subdirectory you need to break after the first iteration of the tree.
  • When topdown is True, the caller can modify the dirnames list in-place (perhaps using del or slice assignment), and walk() will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search, impose a specific order of visiting, or even to inform walk() about directories the caller creates or renames before it resumes walk() again.
    Python's os.walk docs

    As such you should dirnames.remove(dirname) to prevent walking the path and duplicating copies. Or just wasting time.

def copy_tree__nested(src, dest):
 for dirpath, dirnames, filenames in os.walk(src):
 for dirname in dirnames[:]:
 if row['Patient ID'] in dirname:
 dirnames.remove(dirname)
 shutil.copytree(os.path.join(src, dirpath, dirname), os.path.join(dest, dirpath, dirname))
def copy_tree__tld(src, dest):
 for dirpath, dirnames, filenames in os.walk(src):
 for dirname in dirnames:
 if row['Patient ID'] in dirname:
 shutil.copytree(os.path.join(src, dirname), os.path.join(dest, dirname))
 break
for index, row in df.iterrows():
 if row['MSI Status'] == 'MSS':
 print(row['Patient ID'])
 copy_tree(path, '/SeaExp/mona/MSS_Status/MSS')
 if row['MSI Status'] == 'MSI-H':
 print(row['Patient ID'])
 copy_tree(path, '/SeaExp/mona/MSS_Status/MSI-H')
answered Oct 13, 2021 at 19:36
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.