I'm trying to write a short random walk simulator in which an agent has a 50% chance on each "round" to either move toward x or back to its initial position. If the agent is already at its initial position and has to move back, it stays there instead.
I'm getting a lot of variance in the results and running the function with a distance to x of 2000 steps is already taking a while.
I was thinking I'd run the simulator over 200 steps 100 times and take the average of that, but with 40200 steps when x is 200 steps away on average, it's taking almost 10 seconds to run 100 times, which seems extremely slow.
Are there things I could optimize in my code to make it faster? Also, if you see glaring problems with design, suboptimal memory usage, etc...please don't hesitate to comment on those as well, thank you.
PS: I use a random int function of [0,1] for my 50% probability.
import numpy as np
import random as rand
def randomWalk(x):
position = 0
steps = 0
while (position != x):
move = rand.randint(0,1)
steps += 1
if (move == 1):
position += 1
elif (move == 0) & (position != 0):
position -= 1
return steps
trials = [0] * 100
for i in range(100):
trials[i] = randomWalk(200)
sum = 0.
for trial in trials:
sum += trial
avg = 0.
avg = sum / len(trials)
print avg
2 Answers 2
Performance
I threw a profiling program at your code (Spyder IDE built-in profiler). The results showed, that most of the time is wasted on rand.randint(0, 1)
. Instead of calling rand.randint(0, 1)
and checking for 0
and 1
, use rand.random()
. This will give you a random number in the interval \$ [0, 1)\$. Your condition will then become \$ < 0.5 \$. On my machine this reduced the time to run the code from about \10ドル\$-\12ドルs\$ to around \1ドル\$-\1ドル.5s\,ドル which is around an order of magnitude.
My theory on this is that Python uses an algorithm that builds on rand.random()
(or at least the same source of random values) and maps those random values to integers based on the given limits in rand.randint(low, high)
. This mapping probably introduces an small overheard which is not to big to be a real problem if you do not call the function to often. But in your program, the function is called a lot (profiler says more than 4 million times), so the overhead sums up and becomes noteable.
Code and style
Style
I don't know if there is a style requirement where you are writing this program, but in general, Python variable and function names should be all_lower_case
(see the infamous PEP8).
Documentation
It is always nice to have at least a little documentation of the given functions. Especially since Python is an untyped language and maybe others (or a later version of you) finds the code anywhere and want to reuse it. Again there is a nice reference for Python (link).
Comparison operator
There is no need to check for move==0
in the elif
-part since you know the value can only be 0
or 1
. Please use and
and not the bitwise-and &
when you concatenate booleans. See this SO question for a elaborate explenation of the differences
List comprehension
There is no real need to preallocate space with trials = [0] * 100
. You should use a list comprehension instead
trials = [randomWalk(200) for _ in xrange(100)]
Result
I had a few moments of time, so I played a little bit with your code. I implemented the suggestions from my answer, added some documentation, introduced another function, added some variables and used numpy to have a litte bit of extra fun. The code I came up with is following.
import numpy as np
import random
def random_walk(x):
"""Simulate a single random walk"""
position = 0
steps = 0
while (position != x):
move = random.random()
steps += 1
if move < 0.5:
position += 1
elif position != 0:
position -= 1
return steps
def random_walk_simulator(n_walks=100, walk_distance=200):
"""Simulate a bunch of random walks"""
trials = np.array([random_walk(walk_distance) for _ in range(n_walks)])
print("Mean # of steps: {:.2f}".format(trials.mean()))
print("STD # of steps: {:.2f}".format(trials.std()))
if __name__ == '__main__':
random_walk_simulator(100, 200)
A note on numpy: I can not say for sure if there is a real performance difference using numpy instead of list for calculating the mean values etc.. I know from my experience, it has some seriously optimised algorithms that allow fast and numerically safe(r) computations on large arrays. Additionally, numpy allows convenient handling of list/arrays/matrices with more than one dimension. So in your case (without in-depth analysis) it basically comes down to personal preference and maybe convenience.
-
\$\begingroup\$ Thank you for all the advice, this is very helpful. Could I ask you what the profiling program you used is? I didn't know such a thing existed. Do you have a theory as to why changing from randint to random saves that much time? Lastly, what is the difference between using 'and' and '&'? \$\endgroup\$jeremy radcliff– jeremy radcliff2016年03月21日 18:03:09 +00:00Commented Mar 21, 2016 at 18:03
-
\$\begingroup\$ Sorry just one more thing: Is using a numpy array faster than a list when you run computations on its elements (such as sum, mean, std, etc...)? \$\endgroup\$jeremy radcliff– jeremy radcliff2016年03月21日 18:08:02 +00:00Commented Mar 21, 2016 at 18:08
-
\$\begingroup\$ @jeremyradcliff: Please see my edits. \$\endgroup\$AlexV– AlexV2016年03月21日 18:40:42 +00:00Commented Mar 21, 2016 at 18:40
-
\$\begingroup\$ Thanks for the links and clarifications. I really learned a lot from your post, that was great! \$\endgroup\$jeremy radcliff– jeremy radcliff2016年03月21日 19:10:39 +00:00Commented Mar 21, 2016 at 19:10
trials = [0] * 100 for i in range(100): trials[i] = randomWalk(200)
You don't need to create a list of zeroes in order to create a list of random walks. Just use a list comprehension:
trials = [randomWalk(200) for _ in xrange(100)] # In Python3, you should use range().
sum = 0. for trial in trials: sum += trial avg = 0. avg = sum / len(trials)
You shouldn't use sum
as a variable name because it conflicts with the built-in function. You know what? You can use that built-in function:
avg = sum(trials, 0.) / len(trials)
print avg
I'm assuming that you are using Python2 just because of that line (and because of the decimal points in your numbers). If you put parentheses around avg
, your program would run in both versions. Since I used xrange()
above which is not around in Python3, you can't now; but you could also use range()
up there to make it bothways-compatible. range()
takes up more memory than xrange()
, in Python2, but with only a hundred numbers, it shouldn't really make much difference anyway.
numpy
if you're not using it? \$\endgroup\$