How to reduce runtime of gene data processing program?

Question 1

This program takes a huge data set as input, processes it, calculates and then writes the output to an array. Most calculations may be quite simple, such as summation. In the input file, there are about 100 million rows and 3 columns.

first column is the name of the gene (total 100 millions)
second column is the specific value
third column is another value of each gene

The problem I face is a long runtime. How can I reduce it?

I need to write all new values (from GenePair to RM_pval with header) I calculated from the new file.

fi = open ('1.txt')
fo = open ('2.txt','w')
import math
def log(x):
 return math.log(x)
from math import sqrt
import sys
sys.path.append('/tools/lib/python2.7/site-packages')
import numpy
import scipy
import numpy as np
from scipy.stats.distributions import norm
for line in fi.xreadlines():
 tmp = line.split('\t')
 GenePair = tmp[0].strip()
 PCC_A = float(tmp[1].strip())
 PCC_B = float(tmp[2].strip())
 ZVAL_A = 0.5 * log((1+PCC_A)/(1-PCC_A))
 ZVAL_B = 0.5 * log((1+PCC_B)/(1-PCC_B))
 ABS_ZVAL_A = abs(ZVAL_A)
 ABS_ZVAL_B = abs(ZVAL_B)
 Var_A = float(1) / float(21-3) #SAMPLESIZE - 3
 Var_B = float(1) / float(18-3) #SAMPLESIZE - 3
 WT_A = 1/Var_A #float
 WT_B = 1/Var_B #float
 ZVAL_A_X_WT_A = ZVAL_A * WT_A #float
 ZVAL_B_X_WT_B = ZVAL_B * WT_B #float
 SumofWT = (WT_A + WT_B) #float
 SumofZVAL_X_WT = (ZVAL_A_X_WT_A + ZVAL_B_X_WT_B) #float
 #FIXED MODEL
 meanES = SumofZVAL_X_WT / SumofWT #float
 Var = float(1) / SumofWT #float
 SE = math.sqrt(float(Var)) #float
 LL = meanES - (1.96 * SE) #float
 UL = meanES - (1.96 * SE) #float
 z_score = meanES / SE #float
 p_val = scipy.stats.norm.sf(z_score)
 #CAL
 ES_POWER_X_WT_A = pow(ZVAL_A,2) * WT_A #float
 ES_POWER_X_WT_B = pow(ZVAL_B,2) * WT_B #float
 WT_POWER_A = pow(WT_A,2)
 WT_POWER_B = pow(WT_B,2)
 SumofES_POWER_X_WT = ES_POWER_X_WT_A + ES_POWER_X_WT_B
 SumofWT_POWER = WT_POWER_A + WT_POWER_B
 #COMPUTE TAU
 tmp_A = ZVAL_A - meanES
 tmp_B = ZVAL_B - meanES
 temp = pow(SumofZVAL_X_WT,2)
 Q = SumofES_POWER_X_WT - (temp /(SumofWT)) 
 if PCC_A !=0 or PCC_B !=0:
 df = 0
 else:
 df = 1
 c = SumofWT - ((pow(SumofWT,2))/SumofWT)
 if c == 0:
 tau_square = 0
 else:
 tau_square = (Q - df) / c
 #calculation
 Var_total_A = Var_A + tau_square
 Var_total_B = Var_B + tau_square
 WT_total_A = float(1) / Var_total_A
 WT_total_B = float(1) / Var_total_B
 ZVAL_X_WT_total_A = ZVAL_A * WT_total_A
 ZVAL_X_WT_total_B = ZVAL_B * WT_total_B
 Sumoftotal_WT = WT_total_A + WT_total_B
 Sumoftotal_ZVAL_X_WT= ZVAL_X_WT_total_A + ZVAL_X_WT_total_B
 #RANDOM MODEL
 RM_meanES = Sumoftotal_ZVAL_X_WT / Sumoftotal_WT
 RM_Var = float(1) / Sumoftotal_WT
 RM_SE = math.sqrt(float(RM_Var))
 RM_LL = RM_meanES - (1.96 * RM_SE)
 RM_UL = RM_meanES + (1.96 * RM_SE)
 RM_z_score = RM_meanES / RM_Var
 RM_p_val = scipy.stats.norm.sf(RM_z_score)

Question 2

You import numpy but don't take advantage of its vectorized operations. See What is NumPy? to get some ideas.

Question 3

Your program doesn't seem to produce any output. Why is that? Have you cut out the bit that does the output? Or have you not written it yet?

Question 4

The first thing for solving a problem like this is to find out what the actual bottleneck is. In your case the two most likely parameters are disk IO or CPU power.

Disk IO

Make sure the data sits on local drives and not on a network share or USB stick or similar.
Make sure your disks are reasonably fast (SSDs might help).
Put your input file and your output file on two different hard drives.
mmap might gain you some speed in reading and/or writing the data. At least in this case it seemed to make a difference.

CPU

There are several calculations you perform for each line in the input file but those numbers seem to be static (unless the script you posted is incomplete)
- Var_A = float(1) / float(21-3) #SAMPLESIZE - 3 looks like a constant to me which can be calculated once.
- Similar Var_B
- From these two a bunch of other variables are calculated which would also be static: WT_A, WT_B, SumofWT, Var, SE, WT_POWER_A, WT_POWER_B, SumofWT_POWER, c
- Might not make a huge impact but still seems redundant
As each line seems to be independently calculated this code would be a prime example for parallelization. I have not much experience with python and parallel programming but the ProcessPoolExecutor seems like a good start. Bascially utilize as many processes as you have cores and split your input into chunks to be distributed to the processes. You'd have to collect the results (and presumably make sure you write them out in the correct order) so this will make your code a bit more complicated but has the potential to speed up your calculations close to a factor of N (assuming N is the number of CPU cores) provided disk IO is not killing you.
- You could also split the input, distribute it to other computers in the office and get the processing done there and collect the results.

ChrisWue ChrisWue 20.6k4 gold badges42 silver badges107 bronze badges · Accepted Answer · 2013-11-01 07:38:10Z

The first thing for solving a problem like this is to find out what the actual bottleneck is. In your case the two most likely parameters are disk IO or CPU power.

Disk IO

Make sure the data sits on local drives and not on a network share or USB stick or similar.
Make sure your disks are reasonably fast (SSDs might help).
Put your input file and your output file on two different hard drives.
mmap might gain you some speed in reading and/or writing the data. At least in this case it seemed to make a difference.

CPU

There are several calculations you perform for each line in the input file but those numbers seem to be static (unless the script you posted is incomplete)
- Var_A = float(1) / float(21-3) #SAMPLESIZE - 3 looks like a constant to me which can be calculated once.
- Similar Var_B
- From these two a bunch of other variables are calculated which would also be static: WT_A, WT_B, SumofWT, Var, SE, WT_POWER_A, WT_POWER_B, SumofWT_POWER, c
- Might not make a huge impact but still seems redundant
As each line seems to be independently calculated this code would be a prime example for parallelization. I have not much experience with python and parallel programming but the ProcessPoolExecutor seems like a good start. Bascially utilize as many processes as you have cores and split your input into chunks to be distributed to the processes. You'd have to collect the results (and presumably make sure you write them out in the correct order) so this will make your code a bit more complicated but has the potential to speed up your calculations close to a factor of N (assuming N is the number of CPU cores) provided disk IO is not killing you.
- You could also split the input, distribute it to other computers in the office and get the processing done there and collect the results.

Stack Exchange Network

How to reduce runtime of gene data processing program?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to reduce runtime of gene data processing program?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions