The below code runs a t-statistic within a large dataframe (rnadf
) based on masked values from another dataframe (cnvdf_maked
). Before, I was putting my results into a dictionary and turning it into a pandas dataframe. These dataframes became too large for my memory, so I had to switch to writing the results to a file line by line, as below on the last line.
The above code takes a little less than a week to produce one complete file. I have many dataframes to run this on, as in many pairs of the DFs rnadf
and cnvdf
. I have 138 pairs of these DFs pickled in a directory, so I was making my script "parallel" by running each DF pair's ttest caluclation at once in its own screen
session -- for a total of 138 screen sessions. This was a clunky way to do it. When result files get to be around 2.6Gb, they fail with an OS Error 30
.
I've since moved to a new server with about 1Tb of memory, so I'm no longer practically constrained by memory. I'm open to doing two things to improve optimization: 1) make each calculated ttest faster, and 2) parallelize this process to run on many different dataframes at once without using separate screen sessions.
from scipy import stats
import pandas as pd
import numpy as np
import itertools
# sample dataframes
rnadf = pd.DataFrame(np.random.randint(0,100,size=(100, 26)), columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))
cnvdf = pd.DataFrame(np.random.randint(0,1000,size=(100, 25)), columns=list('BCDEFGHIJKLMNOPQRSTUVWXY5'))
cnvdf_mask = cnvdf <= 500
inter = list(set(cnvdf_mask.columns).intersection(rnadf.columns))
cnvdf_mask, rnadf = cnvdf_mask[inter], rnadf[inter]
with open('out_tab.txt', 'w') as f:
for pr in itertools.product(rnadf.index, cnvdf_mask.index):
rnaPos = np.array(rnadf.loc[pr[0]][cnvdf_mask.loc[pr[1]]].dropna())
rnaNeg = np.array(rnadf.loc[pr[0]][~cnvdf_mask.loc[pr[1]]].dropna())
t, p = stats.ttest_ind(rnaPos, rnaNeg)
f.write('{}\t{}\t{}\n'.format(t, p, pr)) # changed pr table to tuple of str indices joined by '&'
1 Answer 1
- make each calculated ttest faster
cProfile
indicates that ttest()
dominates the running time.
(Thank you for a reprex, BTW!)
I don't see any mathematical tricks that would let us do
fewer tests because some of them are identical.
Adjusting the permutations=
parameter produced
no useful effects. So tl;dr: "no".
- parallelize this process to run on many different dataframes at once
That's straightforward, given that memory now is no object.
Your code already has nice looping structure due to itertools.product()
.
The standard solution uses
multiprocessing
module.
We'll need a slightly different loop structure.
from multiprocessing import Pool
def ttest(rna_pos, rna_neg, pr):
t, p = stats.ttest_ind(rna_pos, rna_neg)
return '{}\t{}\t{}\n'.format(t, p, pr)
work = [(np.array(rnadf.loc[pr[0]][cnvdf_mask.loc[pr[1]]].dropna()),
np.array(rnadf.loc[pr[0]][~cnvdf_mask.loc[pr[1]]].dropna()),
str(pr),
)
for pr in itertools.product(rnadf.index, cnvdf_mask.index)
]
with open('out_tab.txt', 'w') as f:
with Pool() as pool:
for result in pool.starmap(ttest, work):
f.write(result)
This will try to burn all cores.
Each has its own interpreter, its own
GIL.
It's worth noting that cost to serialize / deserialize
args + results should be much less than cost of running ttest()
.
Delaying big import
s until you're down in the
child process can also be helpful.
If you want to use the CPUs of several hosts, then Dask or Vaex are happy to help.
-
\$\begingroup\$ I'll note that a minimal example (your link) is a Stack Overflow thing - for review, code must be complete rather than minimal, and should be real rather than an example. But yes, runnable code makes for better reviews. \$\endgroup\$Toby Speight– Toby Speight2023年01月08日 10:34:44 +00:00Commented Jan 8, 2023 at 10:34
Explore related questions
See similar questions with these tags.
OS Error 30
gives it away:OSError: [Errno 30] Read-only file system
. This shouldn't have any relation to the size of the file, except you're on a server. Your server likely puts a limit on the max size of the/tmp
folder per user. Possibly you're running over the limit of your/tmp
folder and you didn't have write permissions on whatever comes after that. \$\endgroup\$