The purpose of this code is to simulate epidemics across a population. There are 625 (pop) individuals at random locations. The epidemic parameters are infectious period (inf_period), trans (transmissibility of the disease - essentially virulence), susc (the susceptibility of each individual to the disease), and eps (epislon, the probability of an individual becoming infectious randomly, not due to contact with an infectious person). The argument 'reps' is the number of times to simulate one set of epidemic parameters - that is, one set of [susc, trans, inf_period, eps].
In this example, there are 24 possible combinations of parameter values, and we want 400 reps per combination, so 24*400 = 9600 runs. Those values cannot change. To make this code faster, how can the number of loops be reduced (I've heard those are slow)?
This has many loops and if
statements, and to run the full version will take roughly 2.5 days. How can it be made more efficient in terms of time? I know that may be vague, so if there's a way I can clarify please let me know! I should also mention I have access to a GPU.
import numpy as np
from scipy import spatial
import json
def fun(susc, trans, inf_period, eps, reps, pop):
epi_list = []
count_list = []
new_susc = []
new_trans = []
new_inf_period = []
new_eps = []
count = 0
epi_file = "file1.json"
count_file = "file2.json"
with open(epi_file, 'w') as f, open(count_file, 'w') as h:
for i in range(len(trans)):
for j in inf_period:
for k in eps:
should_restart = True
while should_restart:
should_restart = False
broken = False
count_2 = 0
for rep in reps:
failcount = 0
g1 = external_function_call(pop, susc, trans[i], j, k, full_mat)
while(len(g1.keys()) < 10 or np.max(g1.values()) < 10):
failcount += 1
if failcount > 50:
trans[i] += 1
broken = True
break
g1 = external_function_call(pop, susc, trans[i], j, k, full_mat) #run again with new i, rep times
if not broken:
g2 = inf_per_count_time(g1)
count += 1
epi_list.append(g1) #if the first epi in the reps works, but the subsequent ones do not, still writes. Bad!
count_list.append(g2)
new_susc.append(susc)
new_trans.append(trans[i])
new_inf_period.append(j)
new_eps.append(k)
else: #start from rep
should_restart = True
if rep > 0: #if we've already written an epidemic using this set of parameters
for i in range(rep-1, -1, -1):
del epi_list[i]
del count_list[i]
del new_susc[i]
del new_trans[i]
del new_inf_period[i]
del new_eps[i]
count -=1
break
else:
break
paras = np.array([np.asarray(new_susc), np.asarray(new_trans), np.asarray(new_inf_period), np.asarray(new_eps)]).T
print 'number of parameter rows', paras[:,0].shape
with open('parameters.txt', 'w') as newfile1:
np.savetxt(newfile1, paras, fmt = ['%f', '%f', '%f', '%f'])
print count
if __name__ == "__main__":
pop = 625
susc = 0.3
trans = [1.5, 2.5, 3]
inf_period = [2, 3]
eps = [0, 0.01, 0.02, 0.05]
reps = np.arange(400)
fun(susc, trans, inf_period, eps, reps, pop)
2 Answers 2
78 characters of indentation at its deepest: this code is unreadable. We can't easily match the core of the code with the definition of the parameters.
To improve that, you can:
- use 4 space per indentation level instead of 8 as recommended per PEP 8;
- use
itertools.product
to iterate over all the combinations of parameters in one single loop instead of 3; - remove unused variable declaration such as your
open
s; - use the
break ... else
construct that can be applied to any loop, this will save you the use of thebroken
flag; - use slice deletion rather than deleting items one by one in a
for
loop (plus it will be more efficient).
This lead to the more readable:
import numpy as np
from scipy import spatial
import json
import itertools
def fun(susc, trans, inf_period, eps, reps, pop):
epi_list = []
count_list = []
new_susc = []
new_trans = []
new_inf_period = []
new_eps = []
count = 0
for i, j, k in itertools.product(range(len(trans)), inf_period, eps):
should_restart = True
while should_restart:
should_restart = False
for rep in reps:
failcount = 0
g1 = external_function_call(pop, susc, trans[i], j, k, full_mat)
while(len(g1.keys()) < 10 or np.max(g1.values()) < 10):
failcount += 1
if failcount > 50:
trans[i] += 1
break
g1 = external_function_call(pop, susc, trans[i], j, k, full_mat) #run again with new i, rep times
else:
g2 = inf_per_count_time(g1)
count += 1
epi_list.append(g1) #if the first epi in the reps works, but the subsequent ones do not, still writes. Bad!
count_list.append(g2)
new_susc.append(susc)
new_trans.append(trans[i])
new_inf_period.append(j)
new_eps.append(k)
continue
# Cleanup because we failed too many times
should_restart = True # restart from rep
deletion_range = slice(0, rep, 1)
del epi_list[deletion_range]
del count_list[deletion_range]
del new_susc[deletion_range]
del new_trans[deletion_range]
del new_inf_period[deletion_range]
del new_eps[deletion_range]
if rep > 0: #if we've already written an epidemic using this set of parameters
count -=1
break
paras = np.array([np.asarray(new_susc), np.asarray(new_trans), np.asarray(new_inf_period), np.asarray(new_eps)]).T
print 'number of parameter rows', paras[:,0].shape
with open('parameters.txt', 'w') as newfile1:
np.savetxt(newfile1, paras, fmt = ['%f', '%f', '%f', '%f'])
print count
if __name__ == "__main__":
pop = 625
susc = 0.3
trans = [1.5, 2.5, 3]
inf_period = [2, 3]
eps = [0, 0.01, 0.02, 0.05]
reps = np.arange(400)
fun(susc, trans, inf_period, eps, reps, pop)
Now we can start thinking a bit about the code.
First off, you don't need to write the call to external_function_call
twice, especially with the same set of parameters. It is more idiomatic to use a while True: <call> if <condition>: break
rather than <call> while <condition>: <call>
. This also let you handle the successful case within that if rather than with your broken
flag.
In this test, you can also take the len
of g1
directly, it's equivalent to using len(g1.keys())
. And since g1
seems to be a regular Python dictionnary, there is no need in involving numpy
there, Python already has a max
builtin.
The fail count could also be better handled with a for
loop and a named constant:
import numpy as np
from scipy import spatial
import json
import itertools
MAX_FAILED_ATTEMPS = 50
def fun(susc, trans, inf_period, eps, reps, pop):
epi_list = []
count_list = []
new_susc = []
new_trans = []
new_inf_period = []
new_eps = []
count = 0
for i, j, k in itertools.product(range(len(trans)), inf_period, eps):
should_restart = True
while should_restart:
should_restart = False
for rep in reps:
for _ in range(MAX_FAILED_ATTEMPS):
g1 = external_function_call(pop, susc, trans[i], j, k, full_mat)
if len(g1) >= 10 and max(g1.values()) >= 10:
g2 = inf_per_count_time(g1)
count += 1
epi_list.append(g1) #if the first epi in the reps works, but the subsequent ones do not, still writes. Bad!
count_list.append(g2)
new_susc.append(susc)
new_trans.append(trans[i])
new_inf_period.append(j)
new_eps.append(k)
break
else:
trans[i] += 1
# Cleanup because we failed too many times
should_restart = True # restart from rep
deletion_range = slice(0, rep, 1)
del epi_list[deletion_range]
del count_list[deletion_range]
del new_susc[deletion_range]
del new_trans[deletion_range]
del new_inf_period[deletion_range]
del new_eps[deletion_range]
if rep > 0: #if we've already written an epidemic using this set of parameters
count -=1
break
paras = np.array([np.asarray(new_susc), np.asarray(new_trans), np.asarray(new_inf_period), np.asarray(new_eps)]).T
print 'number of parameter rows', paras[:,0].shape
with open('parameters.txt', 'w') as newfile1:
np.savetxt(newfile1, paras, fmt = ['%f', '%f', '%f', '%f'])
print count
if __name__ == "__main__":
pop = 625
susc = 0.3
trans = [1.5, 2.5, 3]
inf_period = [2, 3]
eps = [0, 0.01, 0.02, 0.05]
reps = np.arange(400)
fun(susc, trans, inf_period, eps, reps, pop)
Now looking at this rewrite and the comment that was associated to the second call to external_function_call
, it seems unlikely that this loop is doing any good. No parameters are updated between the various calls. So If the call fail once, it will fail 50 times... slowing down the whole thing unnecessarily... Unless you meant to trans[i] += 1
before each new call; or if external_function_call
relly on some form of randomness.
An other thing that bothers me in your code, is how fragile the code is when handling the number of repetitions (reps
). You seem to relly on it always starting at 0. But as it is written, I can pass any range, like range(5000, 5801, 2)
to get 400 repetitions, not necessarily something that will start at 0.
Most importantly, you could have had some combinations of parameters that ran for each rep
, say the first 2, so you will already have 800 results in your arrays. But all of a sudden, the third set of parameter fail 50 times at rep = 40
. So you are deleting elements 39 down to 0 in your array... Wait, what? Why? These are results of previous sets of parameters, they are deemed valid, why on earth should we delete them and keep the 40 last results that we know should be restarted from rep = 0
?
In the same vein, I don't understand why you count -= 1
if rep
is over 0, instead of count -= rep
every time.
import numpy as np
from scipy import spatial
import json
import itertools
MAX_FAILED_ATTEMPS = 50
def fun(susc, trans, inf_period, eps, repetitions, pop):
epi_list = []
count_list = []
new_susc = []
new_trans = []
new_inf_period = []
new_eps = []
count = 0
for i, j, k in itertools.product(range(len(trans)), inf_period, eps):
while True:
for rep in range(repetitions):
for _ in range(MAX_FAILED_ATTEMPS):
g1 = external_function_call(pop, susc, trans[i], j, k, full_mat)
if len(g1) >= 10 and max(g1.values()) >= 10:
g2 = inf_per_count_time(g1)
count += 1
epi_list.append(g1) #if the first epi in the reps works, but the subsequent ones do not, still writes. Bad!
count_list.append(g2)
new_susc.append(susc)
new_trans.append(trans[i])
new_inf_period.append(j)
new_eps.append(k)
break
else:
trans[i] += 1
# Cleanup because we failed too many times
del epi_list[-rep:]
del count_list[-rep:]
del new_susc[-rep:]
del new_trans[-rep:]
del new_inf_period[-rep:]
del new_eps[-rep:]
if rep > 0: #if we've already written an epidemic using this set of parameters
count -=1
break
else:
break # do not restart if we made it through the whole repetitions
paras = np.array([np.asarray(new_susc), np.asarray(new_trans), np.asarray(new_inf_period), np.asarray(new_eps)]).T
print 'number of parameter rows', paras[:,0].shape
with open('parameters.txt', 'w') as newfile1:
np.savetxt(newfile1, paras, fmt = ['%f', '%f', '%f', '%f'])
print count
if __name__ == "__main__":
pop = 625
susc = 0.3
trans = [1.5, 2.5, 3]
inf_period = [2, 3]
eps = [0, 0.01, 0.02, 0.05]
fun(susc, trans, inf_period, eps, 400, pop)
And one last note, I am not entirely sure that modifying trans[i]
in place is a good idea, as it affect not only this set of parameters but also every further combinations using this particular value. Instead, I would only increment a local copy.
Oh, and make something for these meaningless, one-letter, variable names:
import numpy as np
from scipy import spatial
import json
import itertools
MAX_FAILED_ATTEMPS = 50
def fun(susc, trans, inf_period, eps, repetitions, pop):
epi_list = []
count_list = []
new_susc = []
new_trans = []
new_inf_period = []
new_eps = []
count = 0
parameters_product = itertools.product(trans, inf_period, eps)
for transmissibility, infectious_period, epsilon in parameters_product:
while True:
for rep in range(repetitions):
for _ in range(MAX_FAILED_ATTEMPS):
g1 = external_function_call(
pop, susc, transmissibility,
infectious_period, epsilon, full_mat)
if len(g1) >= 10 and max(g1.values()) >= 10:
g2 = inf_per_count_time(g1)
count += 1
epi_list.append(g1)
count_list.append(g2)
new_susc.append(susc)
new_trans.append(transmissibility)
new_inf_period.append(infectious_period)
new_eps.append(epsilon)
break
else:
transmissibility += 1
# Cleanup because we failed too many times
del epi_list[-rep:]
del count_list[-rep:]
del new_susc[-rep:]
del new_trans[-rep:]
del new_inf_period[-rep:]
del new_eps[-rep:]
if rep > 0:
# if we've already written an epidemic
# using this set of parameters
count -=1
break
else:
# do not restart if we made it through the whole repetitions
break
paras = np.array([
np.asarray(new_susc),
np.asarray(new_trans),
np.asarray(new_inf_period),
np.asarray(new_eps)
]).T
print 'number of parameter rows', paras[:,0].shape
with open('parameters.txt', 'w') as newfile1:
np.savetxt(newfile1, paras, fmt = ['%f', '%f', '%f', '%f'])
print count
if __name__ == "__main__":
pop = 625
susc = 0.3
trans = [1.5, 2.5, 3]
inf_period = [2, 3]
eps = [0, 0.01, 0.02, 0.05]
fun(susc, trans, inf_period, eps, 400, pop)
-
1\$\begingroup\$ Very nice answer, +1 :) If you'd not thought of it you can merge all the lists into one epidemic list. I personally would use a named tuple as the value. But it allows you to change
paras
to[[e.susc, e.trans, e.inf_period, e.eps] for e in epidemics]
and change the print toprint '...', len(paras)
. \$\endgroup\$2016年12月16日 10:31:52 +00:00Commented Dec 16, 2016 at 10:31 -
\$\begingroup\$ @Peilonrayz Yes, sure. I also thought of somehow using generators to avoid all these
del
. But I wanted to center it around broad readability improvements and thoughts on possibly buggy behaviours. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2016年12月16日 10:54:56 +00:00Commented Dec 16, 2016 at 10:54 -
\$\begingroup\$ Great answer @MathiasEttinger! Thanks. The only issue is that I need to save count_list for each epidemic (the counts of infectious individuals for each time step). Right now, for some reason, count_list starts from [] for each set of reps. How can I fix that? \$\endgroup\$StatsSorceress– StatsSorceress2016年12月16日 15:35:10 +00:00Commented Dec 16, 2016 at 15:35
-
\$\begingroup\$ @StatsSorceress Did you check you're not hitting
MAX_FAILED_ATTEMPS
each time? \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2016年12月16日 17:17:20 +00:00Commented Dec 16, 2016 at 17:17
I already addressed readability and behaviour concerns in an other answer, but I wanted to also address the performance aspect and point at some good practices.
The first thing to do, when considering performances, is to stop making assumptions and run the program through a profiler. Python makes it easy, all you need is to run your program through the command line using:
python -m cProfile name_of_the_script.py
So using the last version of my other answer and filling the blanks, i.e. adding code that will simulate very simple behaviour of the missing functions:
full_mat = 'something'
def external_function_call(*args):
return {key: random.randint(0, 20) for key in range(random.randint(0, 20))}
def inf_per_count_time(g):
return min(g.values())
I get the following timings. Trimmed for the sake of readability:
number of parameter rows (9600,)
9600
815595 function calls (815474 primitive calls) in 0.770 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
<snip>
1 0.006 0.006 0.771 0.771 virus.py:1(<module>)
18388 0.100 0.000 0.504 0.000 virus.py:10(<dictcomp>)
9600 0.006 0.000 0.015 0.000 virus.py:13(inf_per_count_time)
1 0.044 0.044 0.714 0.714 virus.py:17(fun)
18388 0.022 0.000 0.578 0.000 virus.py:9(external_function_call)
<snip>
0.770 seconds... Far less than the anounced 2.5 days of computation. What is also interesting to note is that, the one call to fun
takes 0.714 seconds and, in that one call, the 18388 calls to external_function_call
takes 0.578 seconds.
This yield two conclusion: the extremely simplified version of external_function_call
is called an average of 2 times per repetition; and it is taking nearly all the time of the computation, leaving a few more than 100ms for the loops and the writing to the file.
So, if you're looking for performance, you will have to:
- Optimize
external_function_call
for speed; - Reduce the amount of calls made to this function; this may mean using better input parameters.
Now, knowing that fun
has almost no influence on the total running time of the program, does not mean we should use inefficient constructs.
The first thing to consider is to use a single result list and to store all 6 results at once. Doing that is pretty easy using tuple
s, but Python provide a thin layer on top of them to make access to each element easier: collections.namedtuple
s
Using them, your code can become:
from collections import namedtuple
import itertools
import csv
MAX_FAILED_ATTEMPS = 50
EpidemyStatistics = namedtuple(
'EpidemyStatistics',
'data count susceptibility transmissibility infectious epsilon')
def fun(susc, trans, inf_period, eps, repetitions, pop):
epidemies = []
count = 0
parameters_product = itertools.product(trans, inf_period, eps)
for transmissibility, infectious_period, epsilon in parameters_product:
while True:
for rep in range(repetitions):
for _ in range(MAX_FAILED_ATTEMPS):
g1 = external_function_call(
pop, susc, transmissibility,
infectious_period, epsilon, full_mat)
if len(g1) >= 10 and max(g1.values()) >= 10:
g2 = inf_per_count_time(g1)
count += 1
epidemies.append(EpidemyStatistics(
g1, g2, susc, transmissibility,
infectious_period, epsilon))
break
else:
transmissibility += 1
# Cleanup because we failed too many times
del epidemies[-rep:]
if rep > 0:
# if we've already written an epidemic
# using this set of parameters
count -=1
break
else:
# do not restart if we made it through the whole repetitions
break
parameters = [
(e.susceptibility, e.transmissibility, e.infectious, e.epsilon)
for e in epidemies
]
print 'number of parameter rows', len(parameters)
with open('parameters.txt', 'w') as newfile1:
writer = csv.writer(newfile1, delimiter=' ')
writer.writerows(parameters)
print count
if __name__ == "__main__":
pop = 625
susc = 0.3
trans = [1.5, 2.5, 3]
inf_period = [2, 3]
eps = [0, 0.01, 0.02, 0.05]
fun(susc, trans, inf_period, eps, 400, pop)
An other improvement would be to avoid the need for del
, ...i.e._ avoid storing elements that you will remove afterwards. One way to do it, is to use an helper method that will temporarily store results for each repetition and return them, unless one repetition fails too many times:
from collections import namedtuple
import itertools
import csv
MAX_FAILED_ATTEMPS = 50
EpidemyStatistics = namedtuple(
'EpidemyStatistics',
'data count susceptibility transmissibility infectious epsilon')
def perform_repetitions(
amount, population, susceptibility,
transmissibility, infectious_period, epsilon):
repetitions = []
for _ in range(amount):
for _ in range(MAX_FAILED_ATTEMPS):
g1 = external_function_call(
pop, susc, transmissibility,
infectious_period, epsilon, full_mat)
if len(g1) >= 10 and max(g1.values()) >= 10:
g2 = inf_per_count_time(g1)
repetitions.append(EpidemyStatistics(
g1, g2, susc, transmissibility,
infectious_period, epsilon))
break
else:
return
return repetitions
def fun(susc, trans, inf_period, eps, repetitions, pop):
epidemies = []
parameters_product = itertools.product(trans, inf_period, eps)
for transmissibility, infectious_period, epsilon in parameters_product:
while True:
statistics = perform_repetitions(
repetitions, pop, susc, transmissibility,
infectious_period, epsilon)
if statistics is not None:
epidemies.extend(statistics)
break
# Failed attempt, try with worse conditions
transmissibility += 1
print 'number of parameter rows', len(epidemies)
with open('parameters.txt', 'w') as newfile1:
writer = csv.writer(newfile1, delimiter=' ')
for parameters in epidemies:
writer.writerow((
parameters.susceptibility,
parameters.transmissibility,
parameters.infectious,
parameters.epsilon
))
if __name__ == "__main__":
pop = 625
susc = 0.3
trans = [1.5, 2.5, 3]
inf_period = [2, 3]
eps = [0, 0.01, 0.02, 0.05]
fun(susc, trans, inf_period, eps, 400, pop)
With these changes, performances of the loops are around 20% better, with nearly the same amount of time spent in external_function_call
:
number of parameter rows 9600
777541 function calls in 0.692 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
<snip>
1 0.006 0.006 0.692 0.692 virus.py:1(<module>)
18506 0.027 0.000 0.566 0.000 virus.py:11(external_function_call)
18506 0.095 0.000 0.488 0.000 virus.py:12(<dictcomp>)
9600 0.007 0.000 0.016 0.000 virus.py:15(inf_per_count_time)
24 0.041 0.002 0.651 0.027 virus.py:24(perform_repetitions)
1 0.010 0.010 0.680 0.680 virus.py:45(fun)
<snip>
The last thing I'd like to talk about, is how frozen this script feels. It's not friendly to modify it in case we want to modify the simulation parameters. Better allow the user to provide them on the command line, with sensible defaults if need be. argparse
can greatly help there.
I have also the feeling that the susceptibility of the individuals to the disease would be a good candidate to parameters tweaking, so I would provide it as a list (even if this list contains only one value by default) and integrate it in the itertools.product
:
from collections import namedtuple
import itertools
import csv
import argparse
MAX_FAILED_ATTEMPS = 50
EpidemyStatistics = namedtuple(
'EpidemyStatistics',
'data count susceptibility transmissibility infectious epsilon')
def perform_repetitions(
amount, population, susceptibility,
transmissibility, infectious_period, epsilon):
repetitions = []
for _ in range(amount):
for _ in range(MAX_FAILED_ATTEMPS):
g1 = external_function_call(
pop, susc, transmissibility,
infectious_period, epsilon, full_mat)
if len(g1) >= 10 and max(g1.values()) >= 10:
g2 = inf_per_count_time(g1)
repetitions.append(EpidemyStatistics(
g1, g2, susc, transmissibility,
infectious_period, epsilon))
break
else:
return
return repetitions
def simulation(
susceptibilities, transmissibilities, infectious_periods,
epsilons, repetitions, population):
epidemies = []
parameters = itertools.product(
susceptibilities, transmissibilities,
infectious_periods, epsilons)
for susceptibility, transmissibility, infectious_period, epsilon in parameters:
while True:
statistics = perform_repetitions(
repetitions, population, susceptibility,
transmissibility, infectious_period, epsilon)
if statistics is not None:
epidemies.extend(statistics)
break
# Failed attempt, try with worse conditions
transmissibility += 1
print 'number of parameter rows', len(epidemies)
with open('parameters.txt', 'w') as newfile1:
writer = csv.writer(newfile1, delimiter=' ')
for parameters in epidemies:
writer.writerow((
parameters.susceptibility,
parameters.transmissibility,
parameters.infectious,
parameters.epsilon
))
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Some infos here')
parser.add_argument('population', type=int)
parser.add_argument('-s', '--susceptibility', type=float, nargs='+', default=[0.3])
parser.add_argument('-t', '--transmissibility', type=float, nargs='+', default=[1.5, 2.5, 3])
parser.add_argument('-i', '--infectious-period', type=int, nargs='+', default=[2, 3])
parser.add_argument('-e', '--epsilon', type=float, nargs='+', default=[0, 0.01, 0.02, 0.05])
parser.add_argument('-r', '--repetitions', type=int, default=400)
args = parser.parse_args()
simulation(
args.susceptibility,
args.transmissibility,
args.infectious_period,
args.epsilon,
args.repetitions,
args.population)
Usage being
python script.py 625
or
python script.py 625 -r 200 -t 1.25 1.5 1.75 2 2.25 -i 4 5 6
Explore related questions
See similar questions with these tags.
external_function_call
orinf_per_count_time
are, it's a bit pointless for the performance aspect. You alsoopen
two files (f
andh
) that you seem to never use in your loops, is it on purpose? \$\endgroup\$