I'm new to coding and have written a script from scratch to do an auto-correlation function. It's too slow and I have to run it on a few thousand files over the next few days. I'm sure it's terribly written (and you can see my brain ticking in basic ways through it) and can be faster.
I know that Python has an inbuilt autocorrelation function, but I think it's restricted to the sense that they use the word in physics. This is for checking residency times of a molecule in a biomolecule simulation and I couldn't see how to apply the default.
import sys
import time
start = time.time()
fileName = sys.argv[1]
time = []
water = []
#make two 1d arrays for time and water
with open(fileName, "r") as input:
for line in input:
time.append(line.split()[0])
water.append(line.split()[1])
# define a totaltimescale in picoseconds - the minus 1 is
# because of the zero index
# windowsize is 2 ns
totaltimescale = 50000 - 1
windowsize = 2000
# endtau should take into consideration the multiplier
# below so 400 with a multiplier of 5 gives us
endtau = 400
alltau = range(1, endtau)
for tau in alltau:
# there is a tau multiplier (i.e. use every 5th tau
# or every 5 picoseconds to speed up calculation)
tau = 5*tau
print tau
# these three values just start counters
counter = 0
noncounter = 0
output = 0
# to set up a value for total number of windows we're interested in,
# leave off 2 times the windowsize at the end arbitrarily so it doesn't
# jump too far, may get errors eventually for this so may have to be wary
numberofwindows = totaltimescale - 2*windowsize
listofwindowstartingpoints = range(0, numberofwindows)
# outer iterator which ensures the window (of size start - start + windowsize)
# is moved down one unit each time, i.e. starts at the next picosecond
for startingpoint in listofwindowstartingpoints:
window = range(startingpoint, startingpoint + windowsize)
for a in window:
if a + tau < totaltimescale:
if water[a] == water[a + tau]:
counter = float(counter + 1)
else:
noncounter = noncounter + 1
totaljumpsanalysed = float(counter + noncounter)
output = counter/totaljumpsanalysed
else:
pass
# counters should have kept values
print counter
print totaljumpsanalysed
print output
with open(fileName + '.autocorrelate', "a") as outfile:
outfile.writelines('{0} {1}{2}'.format(tau, output, '\n'))
print 'It took', time.time()-start, 'seconds.'
It's running on files that look like this (but the time column, the first one, goes up to 50000, the second column is the index of the molecule that goes into the spot).
1.0 24561 0.1255 2.0 24561 0.1267 3.0 -1 0.2265 4.0 24561 0.1095 5.0 24561 0.1263 6.0 -1 0.1598 7.0 -1 0.2024 8.0 -1 0.1798 9.0 -1 0.1823 10.0 24409 0.1291 11.0 24409 0.1288 12.0 -1 0.1537 13.0 -1 0.1853 14.0 -1 0.1625 15.0 24561 0.1300 16.0 24561 0.1342 17.0 24221 0.1294 18.0 24561 0.1165 19.0 24561 0.0611 20.0 24561 0.1259 21.0 6345 0.1284 22.0 -1 0.1421 23.0 -1 0.1435 24.0 -1 0.2599 25.0 -1 0.2659 26.0 -1 0.1961 27.0 24213 0.1398 28.0 24213 0.1325 29.0 -1 0.1809 30.0 -1 0.2044
The output typically looks a bit something like this where the first column is tau and the second 'output' or the probability that it's the molecule with the same index at that point tau.
5 0.674404921846 10 0.59412323094 15 0.543056729494 20 0.503867421031 25 0.470009065414 30 0.442317789517 35 0.417826474489 40 0.398214015522 45 0.379222841801 50 0.360768038436 55 0.344779093024 60 0.330382117003 65 0.316864844888 70 0.306959596948 75 0.29386854062
-
\$\begingroup\$ For a beginner this is incredibly well-written, clean code. I see very few points that can be trivially improved, besides the suggestions made in the answers below. \$\endgroup\$Konrad Rudolph– Konrad Rudolph2012年08月08日 14:19:39 +00:00Commented Aug 8, 2012 at 14:19
2 Answers 2
My first suggestion is to look into parallelization. I believe your problem is one that can be made embarassingly parallel. Each iteration of your loop is independent of each other iteration, so you could run all of them simultaneously. With sufficiently beefy hardware you could get an increase in efficiency proportional to len(alltau) * len(window)
.
Also, you're doing some unnecessary work in your big loop.
- You're calculating
output
fifty thousand times, when you could just calculate it exactly once after the loop ends. - You can keep track of
totaljumpsanalysed
by incrementing it once per loop. Nononcounter
needed. - You don't necessarily need to convert
counter
to a float. Remember, python supports integers of arbitrary size, so you don't need to worry about overflow. else: pass
does nothing, so you can remove it.
totaljumpsanalysed = 0
for startingpoint in listofwindowstartingpoints:
window = range(startingpoint, startingpoint + windowsize)
for a in window:
if a + tau < totaltimescale:
if water[a] == water[a + tau]:
counter += 1
totaljumpsanalysed += 1
output = float(counter)/float(totaljumpsanalysed)
-
\$\begingroup\$ Thanks very much for the help! Now I'm getting an overflow error though
code
Traceback (most recent call last): File "autocorrelationbatchtesting_08_08_2012_codereview.py", line 54, in <module> if water[a] == water[a + tau]: IndexError: list index out of rangecode
It happens at 40000 now, I wrote a temp file to check, I can't think why that would be given there is the if clause in there? \$\endgroup\$cc211– cc2112012年08月08日 15:16:46 +00:00Commented Aug 8, 2012 at 15:16 -
\$\begingroup\$ How many lines are in your input file? If
len(water)
is less thantotaltimescale
, then yourif
guard won't prevent you from going beyond the end of the list. \$\endgroup\$Kevin– Kevin2012年08月08日 15:34:49 +00:00Commented Aug 8, 2012 at 15:34
If you are dead-set on writting your own AC function, there is one suggestion that is better than any of those below or those by Kevin in the other post. USE NUMPY. The code is begging for the speedups by the numpy indexing properties. If pure Python speed is a concern there a couple of simple things to note:
- Since it looks like Python < 3.0, change
range
->xrange
. The former creates the list, the latter creates an iterator. a + tau
is computed multiple times every cycle of the inner loop.