Processes Files Using Subprocess and Os

Question 1

This program I have written automates some of the more manual work in getting the results from my experiments.

The idea is that I have folders and subdirectories in my main directory. Some of these subdirectories contain the data I need process. Inside each of these subdirectories that contain data, there are another five subdirectories with the files. To process these files, I must run a program called processResults.py.

For example, if I have two data folders, then each of these will have 5 subdirectories for a total of 10 subdirectories, each of which must have processResults.py copied into and then ran.

Running processResults.py produces a new directory called results that has a summary csv file inside it called Summary_Total.csv.

All Summary_Total.csv files in a given data folder must be concatenated into a new csv file for processing. So, in our example, there would be two end product csv's that each contain 5 Summary_Total csvs, one for each original subdirectory.

The process described above is what I have automated through a program called generate_results.py.

Before I provide a manual demonstration and an automated one, I want to say that I am looking to

1. Reduce the number of lines of my code

2. Make the code more efficient or optimized

3. Improve documentation

After the demonstrations, the code for processResults.py and generate_results.py, the program I would like to improve, is provided.

(base) ip-75-193:Test trevormartin$ ls
N_Test UN_Test processResults.py generate_results.py
(base) ip-75-193:Test trevormartin$ cd N_Test
(base) ip-75-193:N_Test trevormartin$ cat
x_1 x_3 x_5
x_2 x_4
(base) ip-75-193:N_Test trevormartin$ cd ..; cd UN_Test; ls; cd ..
xx_1 xx_3 xx_5
xx_2 xx_4
(base) ip-75-193:Test trevormartin$

The subdirectories x_1 ... x_5 and xx_1 ... xx_5 are currently empty

For testing here is what you should do:

(base) ip-75-193:Home trevormartin$ mkdir Test
(base) ip-75-193:Home trevormartin$ cd Test
(base) ip-75-193:Test trevormartin$ emacs processResults.py
 (copy my code for processResults.py)
(base) ip-75-193:Test trevormartin$ emacs generate_results.py
 (copy my code for generate_results.py)
(base) ip-75-193:Test trevormartin$ mkdir N_Test
(base) ip-75-193:Test trevormartin$ mkdir UN_Test
(base) ip-75-193:Test trevormartin$ cd N_Test
(base) ip-75-193:N_Test trevormartin$ mkdir x_1 x_2 x_3 x_4 x_5
(base) ip-75-193:N_Test trevormartin$ cd ..
(base) ip-75-193:Test trevormartin$ cd UN_Test
(base) ip-75-193:UN_Test trevormartin$ mkdir xx_1 xx_2 xx_3 xx_4 xx_5
(base) ip-75-193:UN_Test trevormartin$ cd ..

You are now good to go.

Now I will give you the idea of what used to be done manually.

(base) ip-75-193:Test trevormartin$ cp processResults.py ./N_Test/x_1
(base) ip-75-193:Test trevormartin$ cp processResults.py ./N_Test/x_2
.
.
.
(base) ip-75-193:Test trevormartin$ cp processResults.py ./N_Test/x_5
(base) ip-75-193:Test trevormartin$ cp processResults.py ./UN_Test/xx_1
(base) ip-75-193:Test trevormartin$ cp processResults.py ./UN_Test/xx_2
.
.
.
(base) ip-75-193:Test trevormartin$ cp processResults.py ./UN_Test/xx_5
(base) ip-75-193:Test trevormartin$ cd ./N_Test/x_1; python3 processResults.py; cd ..; cd ..
.
.
.
(base) ip-75-193:Test trevormartin$ cd ./N_Test/x_5; python3 processResults.py; cd ..; cd ..
(base) ip-75-193:Test trevormartin$ cd ./UN_Test/xx_1; python3 processResults.py; cd ..; cd ..
.
.
.
(base) ip-75-193:Test trevormartin$ cd ./UN_Test/xx_5; python3 processResults.py; cd ..; cd ..
(base) ip-75-193:Test trevormartin$ cd ./N_Test
(base) ip-75-193:N_Test trevormartin$ ls 
x_1 x_2 x_4
x_3 x_5
(base) ip-75-193:N_Test trevormartin$ cd x_1; ls
processResults.py results
(base) ip-75-193:x_1 trevormartin$ cd results; ls
Summary_Total.csv
(base) ip-75-193:results trevormartin$ cat Summary_Total.csv
A,B,C,D,E,F,G
1,1,1,1,1,1,1
2,2,2,2,2,2,2
3,3,3,3,3,3,3

I would then copy each Summary_Total.csv across all subdirectories into a large csv file for each data folder.

Here is the demonstration of the automated one. There are no files in the subdirectories x_1,...,x_5 and xx_1,...,xx_5 at this point.

(base) ip-75-193:Test trevormartin$ ls
N_Test UN_Test processResults.py generate_results.py
(base) ip-75-193:Test trevormartin$ cd N_Test; ls; cd ..
x_1 x_3 x_5
x_2 x_4
(base) ip-75-193:Test trevormartin$ python3 generate_results.py
(base) ip-75-193:Test trevormartin$ ls 
N_Test UN_Test processResults.py
N_Test.csv UN_Test.csv generate_results.py
(base) ip-75-193:Test trevormartin$ cat N_Test.csv
A,B,C,D,E,F,G
1,1,1,1,1,1,1
2,2,2,2,2,2,2
3,3,3,3,3,3,3
1,1,1,1,1,1,1
2,2,2,2,2,2,2
3,3,3,3,3,3,3
1,1,1,1,1,1,1
2,2,2,2,2,2,2
3,3,3,3,3,3,3
1,1,1,1,1,1,1
2,2,2,2,2,2,2
3,3,3,3,3,3,3
1,1,1,1,1,1,1
2,2,2,2,2,2,2
3,3,3,3,3,3,3

Here is generate_results.py that I would like to improve and here is processResults.py, which is called by generate_results.py

generate_results.py

''' 
Automates running processResults.py and then concatentating the 
csv files into a large csv file 
'''
import csv
import pandas
import os
from subprocess import Popen, call
from functools import reduce
cwd = os.getcwd()
dirs = os.listdir(path='.')
# all data dirs begin with N or U, no other dirs begin with these letters 
needed_dirs = [cwd+'/'+dir for dir in dirs if dir[0] == 'N' or dir[0] == 'U']
for dir in needed_dirs:
 # this only matters if num data folders > 1 
 os.chdir(cwd)
 # copy processResults.py into data folder (i.e. N_Test or UN_Test) 
 process1 = Popen(['cp','processResults.py',dir])
 process1.communicate()
 # change directory to inside data folder (i.e. ./N_Test or ./UN_Test) 
 os.chdir(dir)
 cwd2 = os.getcwd()
 # get all sub dirs in this directory 
 sub_dirs = next(os.walk('.'))[1]
 # needed_subs will look like ['cwd/N_Test/x_1'...'cwd/N_Test/x_5'] 
 needed_subs = [cwd2+'/'+sub_dir for sub_dir in sub_dirs]
 summaries_csvs = []
 for sub_dir in needed_subs:
 # put processResults.py into each sub_directory 
 call(["cp","processResults.py",sub_dir])
 os.chdir(sub_dir)
 # run processResults.py 
 call(["python3","processResults.py"])
 # change the directory back 
 os.chdir(dir)
 for sub_dir in needed_subs:
 # go into each 'results' (generated by processResults.py) and collect all summary csvs 
 os.chdir(sub_dir+'/results')
 summary = pandas.read_csv('Summary_Total.csv', delimiter=',')
 summaries_csvs.append(summary)
 # concatenate all csvs as a pandas dataframe 
 all_dfs = pandas.concat(summaries_csvs).reset_index(drop=True)
 # convert the summaries for a given data folder into a large csv (i.e N_Test.csv or UN_Test.csv) 
 all_dfs.to_csv(path_or_buf=dir+'.csv',
 columns=all_dfs.columns,
 index=False,
 sep=',',
 encoding='utf-8')

processResults.py

''' 
Artificial process.py that illustrates proof of concept. 
This artificial process.py creates a subdirectory called 'results' 
in the current directory and generates a Summary_Total.csv file 
that contains data. 
'''
import csv
import os
data = [['A','B','C','D','E','F','G'],
 [1,1,1,1,1,1,1],
 [2,2,2,2,2,2,2],
 [3,3,3,3,3,3,3]]
if not os.path.exists('results'):
 os.mkdir('results')
with open('results/'+'Summary_Total.csv', "w") as file:
 file_writer = csv.writer(file)
 for row in data:
 file_writer.writerow(row)

Question 2

1) With "if I have two data files, then each of these will have 5 subdirectories", I guess you meant "two data folders" (N_Test, UN_Test). 2) processResults.py script has data as a common hardcoded list - but where does the real csv data for each sub-folder come from, how they differ?

Question 3

Copying processResults.py into the subdirectories is unnecessary and clutters up the code and your disk drive. Change processResults.py so that it takes a path of the directory to process. Better still, turn processResults.py into a function and combine the two scripts into one much simpler one. Without the overhead of using POpen() or call() this should run faster.

''' 
Automates data processing and collection
'''
import csv
from pathlib import Path
def process(directory='.'):
 '''
 Artificial process() function that generates and returns
 dummy Summary_Total data.
 '''
 header = ['A','B','C','D','E','F','G']
 directory = Path(directory).resolve()
 name = directory.parts[-1]
 data = [[name] + [i]*(len(header) - 1) for i in range(1, 4)]
 return [header] + data
def write_csv(path, rows):
 '''
 Boilerplate for writing a list of lists to a csv file
 '''
 with path.open('w', newline='') as f:
 csv.writer(f).writerows(rows)
def generate_results(pattern='*'):
 '''
 Automates processing each data directory under folders matching the glob-style pattern
 and creating individual and collective summary CSV's
 '''
 for folder in Path('.').glob(pattern):
 summaries_csvs = []
 # if the sub_dirs need to be done in order, use sorted(folder.iterdir())
 for sub_dir in folder.iterdir():
 if sub_dir.is_dir():
 sub_dir = sub_dir.resolve()
 summary = process(sub_dir)
 # for the big csv, only include the header from the first summary data
 if not summaries_csvs:
 summaries_csvs.extend(summary)
 else:
 summaries_csvs.extend(summary[1:])
 result_dir = sub_dir / 'results'
 result_dir.mkdir(exist_ok=True)
 # this is the individual csv in each data sub_dir
 write_csv(result_dir / 'Summary_Total.csv', summary)
 # this is the collective summary csv in each data folder
 write_csv(folder.with_suffix('.csv'), summaries_csvs)
# process data folders that match the glob pattern
generate_results('Test/[UN]*Test')

RootTwo RootTwo 10.7k1 gold badge14 silver badges30 bronze badges · Accepted Answer · 2019-12-06 01:59:13Z

Copying processResults.py into the subdirectories is unnecessary and clutters up the code and your disk drive. Change processResults.py so that it takes a path of the directory to process. Better still, turn processResults.py into a function and combine the two scripts into one much simpler one. Without the overhead of using POpen() or call() this should run faster.

''' 
Automates data processing and collection
'''
import csv
from pathlib import Path
def process(directory='.'):
 '''
 Artificial process() function that generates and returns
 dummy Summary_Total data.
 '''
 header = ['A','B','C','D','E','F','G']
 directory = Path(directory).resolve()
 name = directory.parts[-1]
 data = [[name] + [i]*(len(header) - 1) for i in range(1, 4)]
 return [header] + data
def write_csv(path, rows):
 '''
 Boilerplate for writing a list of lists to a csv file
 '''
 with path.open('w', newline='') as f:
 csv.writer(f).writerows(rows)
def generate_results(pattern='*'):
 '''
 Automates processing each data directory under folders matching the glob-style pattern
 and creating individual and collective summary CSV's
 '''
 for folder in Path('.').glob(pattern):
 summaries_csvs = []
 # if the sub_dirs need to be done in order, use sorted(folder.iterdir())
 for sub_dir in folder.iterdir():
 if sub_dir.is_dir():
 sub_dir = sub_dir.resolve()
 summary = process(sub_dir)
 # for the big csv, only include the header from the first summary data
 if not summaries_csvs:
 summaries_csvs.extend(summary)
 else:
 summaries_csvs.extend(summary[1:])
 result_dir = sub_dir / 'results'
 result_dir.mkdir(exist_ok=True)
 # this is the individual csv in each data sub_dir
 write_csv(result_dir / 'Summary_Total.csv', summary)
 # this is the collective summary csv in each data folder
 write_csv(folder.with_suffix('.csv'), summaries_csvs)
# process data folders that match the glob pattern
generate_results('Test/[UN]*Test')

Stack Exchange Network

Processes Files Using Subprocess and Os

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Processes Files Using Subprocess and Os

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions