Combining iterrows of pandas with os.walk given I have a lot of directories

Question 1

I have the following code snippet:

for index, row in df.iterrows():
 if row['MSI Status'] == 'MSS':
 print(row['Patient ID'])
 
 for dirpath, dirnames, filenames in os.walk(path):
 for dirname in dirnames:
 joined_path = os.path.join(path, dirname)
 if row['Patient ID'] in dirname:
 shutil.copytree(joined_path, os.path.join('/SeaExp/mona/MSS_Status/MSS', dirname))
 if row['MSI Status'] == 'MSI-H':
 print(row['Patient ID'])
 
 for dirpath, dirnames, filenames in os.walk(path):
 for dirname in dirnames:
 joined_path = os.path.join(path, dirname)
 if row['Patient ID'] in dirname:
 shutil.copytree(joined_path, os.path.join('/SeaExp/mona/MSS_Status/MSI-H', dirname))

I only have 100 rows in my df but I have many directories in my os.walk. I understand my code is not written well or not efficient and is slow and I hope to get some feedback.

Question 2

What does the file structure look like?

Question 3

First off, you seem to have a typo: os.path.join(path, dirname) should almost certainly be os.path.join(dirpath, dirname). The only way I could see this not causing problems is if all the directories you want to copy are immediate children of path - and in that case you don't want os.walk at all, but should instead use os.scandir. But going forward I'll assume that os.walk is in fact needed here
Since the target directory's name matches the MSI Status field, and those two loops are otherwise identical, it'd be easy to merge those two loops into a single one, ending with something like:
```
destination = os.path.join("/SeaExp/mona/MSS_Status", row["MSI Status"], dirname)
shutil.copytree(joined_path, destination)
```
If those are the only two statuses that are possible, you'd be able to get rid of the ifs entirely. If not, you can still simplify it to something akin to if row['MSI Status'] in {'MSI-H', 'MSS'}:
Back to os.walk issues, once you've found a directory to copy, will you ever need to copy any subdirectory of that directory? For example, for a patient with ID somebody, would you want/need to copy both ./somebody/ and ./somebody/appointment_history/somebody_2020年12月13日 independently? If not, you should modify the dirnames list while iterating to avoid descending into the directories you've already copied - which could perhaps look like this:
```
for dirpath, dirnames, filenames in os.walk(path):
 remaining = []
 for dirname in dirnames:
 if row['Patient ID'] in dirname:
 source = os.path.join(path, dirname)
 destination = os.path.join('/SeaExp/mona/MSS_Status', row['MSI Status'], dirname)
 else:
 remaining.append(dirname)
 dirnames[:] = remaining # Note the use of [:] to update in-place
```

Finally, pandas' iterrows is almost certainly faster than os.walk, so if we're going to do one of those multiple times, it might be best to let that be the iterrows. You might save time by turning your code inside-out like so:

for dirpath, dirnames, filenames in os.walk(path):
 remaining = []
 for index, row in df.iterrows():
 for dirname in dirnames:
 if row['Patient ID'] in dirname:
 source = os.path.join(dirpath, dirname)
 destination = os.path.join(
 '/SeaExp/mona/MSS_Status',
 row['MSI Status'],
 dirname
 )
 shutil.copytree(source, destination)
 else:
 remaining.append(dirname)
 dirnames[:] = remaining

Question 4

If one were to combine our answers, your final bullet point would have the benefit of never walking down trees we've copied from ever. Gaining an even larger benefit than either of our answers would have standing alone. Nice answer.

Question 5

You should make a function to contain the os.walk(path). As both walks are almost identical, the difference is the destination directory.
You're not walking correctly. If I have the file structure:
```
──.
 └─test
 └─test_nested
 └─test_file
```
Then os.walk would walk down the tree, and dirnames would only contain the directory's name (test, test_nested). Ok, so lets say row['Patient ID'] is test_nested then we're into the if.

Now we're copying from ./test_nested to /SeaExp/mona/MSS_Status/MSS/test_nested. However the path to test_nested is actually ./test/test_nested.

You need to either:
1. Include dirpath in the source and/or the destination path. os.path.join(src, dirpath, dirname).
2. Or assuming your code works correctly because if row['Patient ID'] in dirname: is never true in a subdirectory you need to break after the first iteration of the tree.
When topdown is True, the caller can modify the dirnames list in-place (perhaps using del or slice assignment), and walk() will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search, impose a specific order of visiting, or even to inform walk() about directories the caller creates or renames before it resumes walk() again.
Python's os.walk docs

As such you should dirnames.remove(dirname) to prevent walking the path and duplicating copies. Or just wasting time.

def copy_tree__nested(src, dest):
 for dirpath, dirnames, filenames in os.walk(src):
 for dirname in dirnames[:]:
 if row['Patient ID'] in dirname:
 dirnames.remove(dirname)
 shutil.copytree(os.path.join(src, dirpath, dirname), os.path.join(dest, dirpath, dirname))
def copy_tree__tld(src, dest):
 for dirpath, dirnames, filenames in os.walk(src):
 for dirname in dirnames:
 if row['Patient ID'] in dirname:
 shutil.copytree(os.path.join(src, dirname), os.path.join(dest, dirname))
 break
for index, row in df.iterrows():
 if row['MSI Status'] == 'MSS':
 print(row['Patient ID'])
 copy_tree(path, '/SeaExp/mona/MSS_Status/MSS')
 if row['MSI Status'] == 'MSI-H':
 print(row['Patient ID'])
 copy_tree(path, '/SeaExp/mona/MSS_Status/MSI-H')

Sara J Sara J 4,15612 silver badges37 bronze badges · Answer 1 · 2021-10-13 20:30:31Z

First off, you seem to have a typo: os.path.join(path, dirname) should almost certainly be os.path.join(dirpath, dirname). The only way I could see this not causing problems is if all the directories you want to copy are immediate children of path - and in that case you don't want os.walk at all, but should instead use os.scandir. But going forward I'll assume that os.walk is in fact needed here
Since the target directory's name matches the MSI Status field, and those two loops are otherwise identical, it'd be easy to merge those two loops into a single one, ending with something like:
```
destination = os.path.join("/SeaExp/mona/MSS_Status", row["MSI Status"], dirname)
shutil.copytree(joined_path, destination)
```
If those are the only two statuses that are possible, you'd be able to get rid of the ifs entirely. If not, you can still simplify it to something akin to if row['MSI Status'] in {'MSI-H', 'MSS'}:
Back to os.walk issues, once you've found a directory to copy, will you ever need to copy any subdirectory of that directory? For example, for a patient with ID somebody, would you want/need to copy both ./somebody/ and ./somebody/appointment_history/somebody_2020年12月13日 independently? If not, you should modify the dirnames list while iterating to avoid descending into the directories you've already copied - which could perhaps look like this:
```
for dirpath, dirnames, filenames in os.walk(path):
 remaining = []
 for dirname in dirnames:
 if row['Patient ID'] in dirname:
 source = os.path.join(path, dirname)
 destination = os.path.join('/SeaExp/mona/MSS_Status', row['MSI Status'], dirname)
 else:
 remaining.append(dirname)
 dirnames[:] = remaining # Note the use of [:] to update in-place
```

Finally, pandas' iterrows is almost certainly faster than os.walk, so if we're going to do one of those multiple times, it might be best to let that be the iterrows. You might save time by turning your code inside-out like so:

for dirpath, dirnames, filenames in os.walk(path):
 remaining = []
 for index, row in df.iterrows():
 for dirname in dirnames:
 if row['Patient ID'] in dirname:
 source = os.path.join(dirpath, dirname)
 destination = os.path.join(
 '/SeaExp/mona/MSS_Status',
 row['MSI Status'],
 dirname
 )
 shutil.copytree(source, destination)
 else:
 remaining.append(dirname)
 dirnames[:] = remaining

If one were to combine our answers, your final bullet point would have the benefit of never walking down trees we've copied from ever. Gaining an even larger benefit than either of our answers would have standing alone. Nice answer.

Peilonrayz ♦Peilonrayz 44.4k7 gold badges80 silver badges157 bronze badges · Answer 2 · 2021-10-13 19:36:30Z

You should make a function to contain the os.walk(path). As both walks are almost identical, the difference is the destination directory.
You're not walking correctly. If I have the file structure:
```
──.
 └─test
 └─test_nested
 └─test_file
```
Then os.walk would walk down the tree, and dirnames would only contain the directory's name (test, test_nested). Ok, so lets say row['Patient ID'] is test_nested then we're into the if.

Now we're copying from ./test_nested to /SeaExp/mona/MSS_Status/MSS/test_nested. However the path to test_nested is actually ./test/test_nested.

You need to either:
1. Include dirpath in the source and/or the destination path. os.path.join(src, dirpath, dirname).
2. Or assuming your code works correctly because if row['Patient ID'] in dirname: is never true in a subdirectory you need to break after the first iteration of the tree.
When topdown is True, the caller can modify the dirnames list in-place (perhaps using del or slice assignment), and walk() will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search, impose a specific order of visiting, or even to inform walk() about directories the caller creates or renames before it resumes walk() again.
Python's os.walk docs

As such you should dirnames.remove(dirname) to prevent walking the path and duplicating copies. Or just wasting time.

def copy_tree__nested(src, dest):
 for dirpath, dirnames, filenames in os.walk(src):
 for dirname in dirnames[:]:
 if row['Patient ID'] in dirname:
 dirnames.remove(dirname)
 shutil.copytree(os.path.join(src, dirpath, dirname), os.path.join(dest, dirpath, dirname))
def copy_tree__tld(src, dest):
 for dirpath, dirnames, filenames in os.walk(src):
 for dirname in dirnames:
 if row['Patient ID'] in dirname:
 shutil.copytree(os.path.join(src, dirname), os.path.join(dest, dirname))
 break
for index, row in df.iterrows():
 if row['MSI Status'] == 'MSS':
 print(row['Patient ID'])
 copy_tree(path, '/SeaExp/mona/MSS_Status/MSS')
 if row['MSI Status'] == 'MSI-H':
 print(row['Patient ID'])
 copy_tree(path, '/SeaExp/mona/MSS_Status/MSI-H')

Stack Exchange Network

Combining iterrows of pandas with os.walk given I have a lot of directories

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Combining iterrows of pandas with os.walk given I have a lot of directories

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions