0

I recently signed up for a scientific python course and it was shown in classroom how a numpy array can perform better than a list in some situations. For a statistical simulation, I tried the two approaches and surprinsingly, the numpy array takes a lot longer to finish the process. Could someone help me find my(possible) mistake?

My first thought was that probably something is wrong with the way the code is written, but I can't figure out how it can be wrong. The script calculates how many attempts on average someone needs to complete a collection of sorted stickers:

Python list

I used a function and no external modules.

import random as rd
import statistics as st
def collectStickers(experiments, collectible):
 obtained = []
 attempts = 0
 while(len(obtained) < collectible):
 new_sticker = rd.randint(1, collectible)
 if new_sticker not in obtained:
 obtained.append(new_sticker)
 attempts += 1
 experiments.append(attempts)
experiments = []
collectible = 20
rep_experiment = 100000
for i in range(1, rep_experiment):
 collectStickers(experiments, collectible)
print(st.mean(experiments))

Results

The processing time seems ok for a simple experiment like this one, but for more complex purposes, 13.8 seconds is too much.

72.06983069830699
[Finished in 13.8s]

Numpy

I could not use any function as the following errors showed up when I followed the same logic as above:

  • RuntimeWarning: Mean of empty slice.

  • RuntimeWarning: invalid value encountered in double_scalars

So I just went for the naive way:

import random as rd
import numpy as np
experiments = np.array([])
rep_experiment = 100000
for i in range(1, rep_experiment):
 obtained = np.array([])
 attempts = 0
 while(len(obtained) < 20):
 new_sticker = rd.randint(1, 20)
 if new_sticker not in obtained:
 obtained = np.append(obtained, new_sticker)
 attempts += 1
 experiments = np.append(experiments, attempts)
print(np.mean(experiments))

Results

Almost 4x slower!

Is the difference in the use of a function?

72.03112031120311
[Finished in 54.2s]
asked Mar 16, 2021 at 15:36
3
  • 1
    @TedKleinBergman Over 100 000 iterations, that shouldn't matter very much at all. Commented Mar 16, 2021 at 15:45
  • @TedKleinBergman. That really has nothing to with it. Commented Mar 16, 2021 at 15:47
  • 1
    Direct translations of list oriented code to numpy are usually are slower. numpy is best when using its whole-array methods. Iterating through an array one by one is slow, as is building an array iteratively (np.append is not a list append clone). np.random is fast - if you ask for 1000s of random numbers at once. Effective use of numpy requires looking at the problem in a different way, from the top down. Commented Mar 16, 2021 at 17:24

2 Answers 2

4

To really take into account the power of numpy arrays, you need to program in a numpy way. For example, try to vectorize the experiment like this:

def vectorized():
 rep_experiment = 100000
 collectible = 20
 # array of falses
 obtained = np.zeros(rep_experiment, dtype=bool)
 attempts = np.zeros(rep_experiment, dtype=int)
 targets = np.zeros((rep_experiment, collectible), dtype=bool)
 x = np.arange(0,100000, step=1, dtype=int)
 while False in targets:
 r = np.random.randint(0, collectible, size=rep_experiment)
 # add the new stickers to the collected target
 targets[x,r] = True
 # if collected all set obtained to True
 obtained[np.sum(targets, axis=1)==collectible] = True
 # increments the not obtained values
 attempts[~obtained] += 1
 print(attempts.mean(), attempts.std())

check the speed up, it is around 50 X for me

answered Mar 16, 2021 at 16:16
3

np.append copies the array before appending to it.

Your program is probably spending most of its time doing those unnecessary copies in

experiments = np.append(experiments, attempts)

EDIT

As expected, replacing the quadratic-esque np.append() with a predefined array makes the wrapper function approximately the same speed.

Replacing the list of obtained stickers with a set makes things a little faster.

However, the bottleneck is a slow random number generator. Running through cProfile reveals that 75% of the execution time is spent in randint().

See below the code for the results (on my machine).

import random
import statistics
import timeit
import numpy as np
collectible = 20
rep_experiment = 10000
def original_collect_stickers():
 obtained = []
 attempts = 0
 while len(obtained) < collectible:
 new_sticker = random.randint(1, collectible)
 if new_sticker not in obtained:
 obtained.append(new_sticker)
 attempts += 1
 return attempts
def set_collect_stickers():
 obtained = set()
 attempts = 0
 n = 0
 while n < collectible:
 new_sticker = random.randint(1, collectible)
 if new_sticker not in obtained:
 obtained.add(new_sticker)
 n += 1
 attempts += 1
 return attempts
def repeat_with_list(fn):
 experiments = []
 for i in range(rep_experiment):
 experiments.append(fn())
 return statistics.mean(experiments)
def repeat_with_numpy(fn):
 experiments = np.zeros(rep_experiment)
 for i in range(rep_experiment):
 experiments[i] = fn()
 return np.mean(experiments)
def time_fn(name, fn, n=3):
 time_taken = timeit.timeit(fn, number=n) / n
 result = fn() # once more to get the result too
 print(f"{name:15}: {time_taken:.6f}, result {result}")
for wrapper in (repeat_with_list, repeat_with_numpy):
 for fn in (original_collect_stickers, set_collect_stickers):
 time_fn(f"{wrapper.__name__} {fn.__name__}", lambda: wrapper(fn))

The result is

repeat_with_list original_collect_stickers: 0.747183, result 71.7912
repeat_with_list set_collect_stickers: 0.688952, result 72.1002
repeat_with_numpy original_collect_stickers: 0.752644, result 72.0978
repeat_with_numpy set_collect_stickers: 0.685355, result 71.7515

EDIT 2

Using the fastrand library's pcg32bounded() generator, i.e. new_sticker = fastrand.pcg32bounded(collectible) makes things plenty fast:

repeat_with_list original_collect_stickers: 0.761186, result 72.0185
repeat_with_list set_collect_stickers: 0.690244, result 71.878
repeat_with_list set_collect_stickers_fastrand: 0.116410, result 71.9323
repeat_with_numpy original_collect_stickers: 0.759154, result 71.8604
repeat_with_numpy set_collect_stickers: 0.696563, result 71.5482
repeat_with_numpy set_collect_stickers_fastrand: 0.114212, result 71.6369
answered Mar 16, 2021 at 15:38
1
  • Thank you for such a complete answer, it was really helpful! Commented Mar 16, 2021 at 17:01

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.