Return to Answer

replaced http://stackoverflow.com/ with https://stackoverflow.com/

edited May 23, 2017 at 12:40

The first thing for solving a problem like this is to find out what the actual bottleneck is. In your case the two most likely parameters are disk IO or CPU power.

Disk IO

Make sure the data sits on local drives and not on a network share or USB stick or similar.
Make sure your disks are reasonably fast (SSDs might help).
Put your input file and your output file on two different hard drives.
mmap might gain you some speed in reading and/or writing the data. At least in this case in this case it seemed to make a difference.

CPU

There are several calculations you perform for each line in the input file but those numbers seem to be static (unless the script you posted is incomplete)
- Var_A = float(1) / float(21-3) #SAMPLESIZE - 3 looks like a constant to me which can be calculated once.
- Similar Var_B
- From these two a bunch of other variables are calculated which would also be static: WT_A, WT_B, SumofWT, Var, SE, WT_POWER_A, WT_POWER_B, SumofWT_POWER, c
- Might not make a huge impact but still seems redundant
As each line seems to be independently calculated this code would be a prime example for parallelization. I have not much experience with python and parallel programming but the ProcessPoolExecutor seems like a good start. Bascially utilize as many processes as you have cores and split your input into chunks to be distributed to the processes. You'd have to collect the results (and presumably make sure you write them out in the correct order) so this will make your code a bit more complicated but has the potential to speed up your calculations close to a factor of N (assuming N is the number of CPU cores) provided disk IO is not killing you.
- You could also split the input, distribute it to other computers in the office and get the processing done there and collect the results.

The first thing for solving a problem like this is to find out what the actual bottleneck is. In your case the two most likely parameters are disk IO or CPU power.

Disk IO

Make sure the data sits on local drives and not on a network share or USB stick or similar.
Make sure your disks are reasonably fast (SSDs might help).
Put your input file and your output file on two different hard drives.
mmap might gain you some speed in reading and/or writing the data. At least in this case it seemed to make a difference.

CPU

There are several calculations you perform for each line in the input file but those numbers seem to be static (unless the script you posted is incomplete)
- Var_A = float(1) / float(21-3) #SAMPLESIZE - 3 looks like a constant to me which can be calculated once.
- Similar Var_B
- From these two a bunch of other variables are calculated which would also be static: WT_A, WT_B, SumofWT, Var, SE, WT_POWER_A, WT_POWER_B, SumofWT_POWER, c
- Might not make a huge impact but still seems redundant
As each line seems to be independently calculated this code would be a prime example for parallelization. I have not much experience with python and parallel programming but the ProcessPoolExecutor seems like a good start. Bascially utilize as many processes as you have cores and split your input into chunks to be distributed to the processes. You'd have to collect the results (and presumably make sure you write them out in the correct order) so this will make your code a bit more complicated but has the potential to speed up your calculations close to a factor of N (assuming N is the number of CPU cores) provided disk IO is not killing you.
- You could also split the input, distribute it to other computers in the office and get the processing done there and collect the results.

The first thing for solving a problem like this is to find out what the actual bottleneck is. In your case the two most likely parameters are disk IO or CPU power.

Disk IO

Make sure the data sits on local drives and not on a network share or USB stick or similar.
Make sure your disks are reasonably fast (SSDs might help).
Put your input file and your output file on two different hard drives.
mmap might gain you some speed in reading and/or writing the data. At least in this case it seemed to make a difference.

CPU

There are several calculations you perform for each line in the input file but those numbers seem to be static (unless the script you posted is incomplete)
- Var_A = float(1) / float(21-3) #SAMPLESIZE - 3 looks like a constant to me which can be calculated once.
- Similar Var_B
- From these two a bunch of other variables are calculated which would also be static: WT_A, WT_B, SumofWT, Var, SE, WT_POWER_A, WT_POWER_B, SumofWT_POWER, c
- Might not make a huge impact but still seems redundant
As each line seems to be independently calculated this code would be a prime example for parallelization. I have not much experience with python and parallel programming but the ProcessPoolExecutor seems like a good start. Bascially utilize as many processes as you have cores and split your input into chunks to be distributed to the processes. You'd have to collect the results (and presumably make sure you write them out in the correct order) so this will make your code a bit more complicated but has the potential to speed up your calculations close to a factor of N (assuming N is the number of CPU cores) provided disk IO is not killing you.
- You could also split the input, distribute it to other computers in the office and get the processing done there and collect the results.

Source Link

answered Nov 1, 2013 at 7:38

ChrisWue

answered Nov 1, 2013 at 7:38

ChrisWue

20.6k
4
42
107

The first thing for solving a problem like this is to find out what the actual bottleneck is. In your case the two most likely parameters are disk IO or CPU power.

Disk IO

Make sure the data sits on local drives and not on a network share or USB stick or similar.
Make sure your disks are reasonably fast (SSDs might help).
Put your input file and your output file on two different hard drives.
mmap might gain you some speed in reading and/or writing the data. At least in this case it seemed to make a difference.

CPU

There are several calculations you perform for each line in the input file but those numbers seem to be static (unless the script you posted is incomplete)
- Var_A = float(1) / float(21-3) #SAMPLESIZE - 3 looks like a constant to me which can be calculated once.
- Similar Var_B
- From these two a bunch of other variables are calculated which would also be static: WT_A, WT_B, SumofWT, Var, SE, WT_POWER_A, WT_POWER_B, SumofWT_POWER, c
- Might not make a huge impact but still seems redundant
As each line seems to be independently calculated this code would be a prime example for parallelization. I have not much experience with python and parallel programming but the ProcessPoolExecutor seems like a good start. Bascially utilize as many processes as you have cores and split your input into chunks to be distributed to the processes. You'd have to collect the results (and presumably make sure you write them out in the correct order) so this will make your code a bit more complicated but has the potential to speed up your calculations close to a factor of N (assuming N is the number of CPU cores) provided disk IO is not killing you.
- You could also split the input, distribute it to other computers in the office and get the processing done there and collect the results.

lang-py