Calculating T-Test within Large Pandas Dataframes

Question 1

The below code runs a t-statistic within a large dataframe (rnadf) based on masked values from another dataframe (cnvdf_maked). Before, I was putting my results into a dictionary and turning it into a pandas dataframe. These dataframes became too large for my memory, so I had to switch to writing the results to a file line by line, as below on the last line.

The above code takes a little less than a week to produce one complete file. I have many dataframes to run this on, as in many pairs of the DFs rnadf and cnvdf. I have 138 pairs of these DFs pickled in a directory, so I was making my script "parallel" by running each DF pair's ttest caluclation at once in its own screen session -- for a total of 138 screen sessions. This was a clunky way to do it. When result files get to be around 2.6Gb, they fail with an OS Error 30.

I've since moved to a new server with about 1Tb of memory, so I'm no longer practically constrained by memory. I'm open to doing two things to improve optimization: 1) make each calculated ttest faster, and 2) parallelize this process to run on many different dataframes at once without using separate screen sessions.

from scipy import stats
import pandas as pd
import numpy as np
import itertools
# sample dataframes
rnadf = pd.DataFrame(np.random.randint(0,100,size=(100, 26)), columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))
cnvdf = pd.DataFrame(np.random.randint(0,1000,size=(100, 25)), columns=list('BCDEFGHIJKLMNOPQRSTUVWXY5'))
cnvdf_mask = cnvdf <= 500
inter = list(set(cnvdf_mask.columns).intersection(rnadf.columns))
cnvdf_mask, rnadf = cnvdf_mask[inter], rnadf[inter]
with open('out_tab.txt', 'w') as f:
 for pr in itertools.product(rnadf.index, cnvdf_mask.index):
 rnaPos = np.array(rnadf.loc[pr[0]][cnvdf_mask.loc[pr[1]]].dropna())
 rnaNeg = np.array(rnadf.loc[pr[0]][~cnvdf_mask.loc[pr[1]]].dropna())
 t, p = stats.ttest_ind(rnaPos, rnaNeg)
 f.write('{}\t{}\t{}\n'.format(t, p, pr)) # changed pr table to tuple of str indices joined by '&'

Question 2

The full text of OS Error 30 gives it away: OSError: [Errno 30] Read-only file system. This shouldn't have any relation to the size of the file, except you're on a server. Your server likely puts a limit on the max size of the /tmp folder per user. Possibly you're running over the limit of your /tmp folder and you didn't have write permissions on whatever comes after that.

Question 3

make each calculated ttest faster

cProfile indicates that ttest() dominates the running time. (Thank you for a reprex, BTW!)

I don't see any mathematical tricks that would let us do fewer tests because some of them are identical. Adjusting the permutations= parameter produced no useful effects. So tl;dr: "no".

parallelize this process to run on many different dataframes at once

That's straightforward, given that memory now is no object. Your code already has nice looping structure due to itertools.product().

The standard solution uses multiprocessing module. We'll need a slightly different loop structure.

from multiprocessing import Pool
def ttest(rna_pos, rna_neg, pr):
 t, p = stats.ttest_ind(rna_pos, rna_neg)
 return '{}\t{}\t{}\n'.format(t, p, pr)
work = [(np.array(rnadf.loc[pr[0]][cnvdf_mask.loc[pr[1]]].dropna()),
 np.array(rnadf.loc[pr[0]][~cnvdf_mask.loc[pr[1]]].dropna()),
 str(pr),
 )
 for pr in itertools.product(rnadf.index, cnvdf_mask.index)
]
with open('out_tab.txt', 'w') as f:
 with Pool() as pool:
 for result in pool.starmap(ttest, work):
 f.write(result)

This will try to burn all cores. Each has its own interpreter, its own GIL. It's worth noting that cost to serialize / deserialize args + results should be much less than cost of running ttest(). Delaying big imports until you're down in the child process can also be helpful.

If you want to use the CPUs of several hosts, then Dask or Vaex are happy to help.

Question 4

I'll note that a minimal example (your link) is a Stack Overflow thing - for review, code must be complete rather than minimal, and should be real rather than an example. But yes, runnable code makes for better reviews.

J_H J_H 41.8k3 gold badges38 silver badges157 bronze badges · Answer 1 · 2023-01-07 19:05:37Z

make each calculated ttest faster

cProfile indicates that ttest() dominates the running time. (Thank you for a reprex, BTW!)

I don't see any mathematical tricks that would let us do fewer tests because some of them are identical. Adjusting the permutations= parameter produced no useful effects. So tl;dr: "no".

parallelize this process to run on many different dataframes at once

That's straightforward, given that memory now is no object. Your code already has nice looping structure due to itertools.product().

The standard solution uses multiprocessing module. We'll need a slightly different loop structure.

from multiprocessing import Pool
def ttest(rna_pos, rna_neg, pr):
 t, p = stats.ttest_ind(rna_pos, rna_neg)
 return '{}\t{}\t{}\n'.format(t, p, pr)
work = [(np.array(rnadf.loc[pr[0]][cnvdf_mask.loc[pr[1]]].dropna()),
 np.array(rnadf.loc[pr[0]][~cnvdf_mask.loc[pr[1]]].dropna()),
 str(pr),
 )
 for pr in itertools.product(rnadf.index, cnvdf_mask.index)
]
with open('out_tab.txt', 'w') as f:
 with Pool() as pool:
 for result in pool.starmap(ttest, work):
 f.write(result)

This will try to burn all cores. Each has its own interpreter, its own GIL. It's worth noting that cost to serialize / deserialize args + results should be much less than cost of running ttest(). Delaying big imports until you're down in the child process can also be helpful.

If you want to use the CPUs of several hosts, then Dask or Vaex are happy to help.

I'll note that a minimal example (your link) is a Stack Overflow thing - for review, code must be complete rather than minimal, and should be real rather than an example. But yes, runnable code makes for better reviews.

Stack Exchange Network

Calculating T-Test within Large Pandas Dataframes

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Calculating T-Test within Large Pandas Dataframes

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions