Python list performs better than numpy array?

Question 1

I recently signed up for a scientific python course and it was shown in classroom how a numpy array can perform better than a list in some situations. For a statistical simulation, I tried the two approaches and surprinsingly, the numpy array takes a lot longer to finish the process. Could someone help me find my(possible) mistake?

My first thought was that probably something is wrong with the way the code is written, but I can't figure out how it can be wrong. The script calculates how many attempts on average someone needs to complete a collection of sorted stickers:

Python list

I used a function and no external modules.

import random as rd
import statistics as st
def collectStickers(experiments, collectible):
 obtained = []
 attempts = 0
 while(len(obtained) < collectible):
 new_sticker = rd.randint(1, collectible)
 if new_sticker not in obtained:
 obtained.append(new_sticker)
 attempts += 1
 experiments.append(attempts)
experiments = []
collectible = 20
rep_experiment = 100000
for i in range(1, rep_experiment):
 collectStickers(experiments, collectible)
print(st.mean(experiments))

Results

The processing time seems ok for a simple experiment like this one, but for more complex purposes, 13.8 seconds is too much.

72.06983069830699
[Finished in 13.8s]

Numpy

I could not use any function as the following errors showed up when I followed the same logic as above:

RuntimeWarning: Mean of empty slice.
RuntimeWarning: invalid value encountered in double_scalars

So I just went for the naive way:

import random as rd
import numpy as np
experiments = np.array([])
rep_experiment = 100000
for i in range(1, rep_experiment):
 obtained = np.array([])
 attempts = 0
 while(len(obtained) < 20):
 new_sticker = rd.randint(1, 20)
 if new_sticker not in obtained:
 obtained = np.append(obtained, new_sticker)
 attempts += 1
 experiments = np.append(experiments, attempts)
print(np.mean(experiments))

Results

Almost 4x slower!

Is the difference in the use of a function?

72.03112031120311
[Finished in 54.2s]

Question 2

@TedKleinBergman Over 100 000 iterations, that shouldn't matter very much at all.

Question 3

@TedKleinBergman. That really has nothing to with it.

Question 4

Direct translations of list oriented code to numpy are usually are slower. numpy is best when using its whole-array methods. Iterating through an array one by one is slow, as is building an array iteratively (np.append is not a list append clone). np.random is fast - if you ask for 1000s of random numbers at once. Effective use of numpy requires looking at the problem in a different way, from the top down.

Question 5

To really take into account the power of numpy arrays, you need to program in a numpy way. For example, try to vectorize the experiment like this:

def vectorized():
 rep_experiment = 100000
 collectible = 20
 # array of falses
 obtained = np.zeros(rep_experiment, dtype=bool)
 attempts = np.zeros(rep_experiment, dtype=int)
 targets = np.zeros((rep_experiment, collectible), dtype=bool)
 x = np.arange(0,100000, step=1, dtype=int)
 while False in targets:
 r = np.random.randint(0, collectible, size=rep_experiment)
 # add the new stickers to the collected target
 targets[x,r] = True
 # if collected all set obtained to True
 obtained[np.sum(targets, axis=1)==collectible] = True
 # increments the not obtained values
 attempts[~obtained] += 1
 print(attempts.mean(), attempts.std())

check the speed up, it is around 50 X for me

Question 6

np.append copies the array before appending to it.

Your program is probably spending most of its time doing those unnecessary copies in

experiments = np.append(experiments, attempts)

EDIT

As expected, replacing the quadratic-esque np.append() with a predefined array makes the wrapper function approximately the same speed.

Replacing the list of obtained stickers with a set makes things a little faster.

However, the bottleneck is a slow random number generator. Running through cProfile reveals that 75% of the execution time is spent in randint().

See below the code for the results (on my machine).

import random
import statistics
import timeit
import numpy as np
collectible = 20
rep_experiment = 10000
def original_collect_stickers():
 obtained = []
 attempts = 0
 while len(obtained) < collectible:
 new_sticker = random.randint(1, collectible)
 if new_sticker not in obtained:
 obtained.append(new_sticker)
 attempts += 1
 return attempts
def set_collect_stickers():
 obtained = set()
 attempts = 0
 n = 0
 while n < collectible:
 new_sticker = random.randint(1, collectible)
 if new_sticker not in obtained:
 obtained.add(new_sticker)
 n += 1
 attempts += 1
 return attempts
def repeat_with_list(fn):
 experiments = []
 for i in range(rep_experiment):
 experiments.append(fn())
 return statistics.mean(experiments)
def repeat_with_numpy(fn):
 experiments = np.zeros(rep_experiment)
 for i in range(rep_experiment):
 experiments[i] = fn()
 return np.mean(experiments)
def time_fn(name, fn, n=3):
 time_taken = timeit.timeit(fn, number=n) / n
 result = fn() # once more to get the result too
 print(f"{name:15}: {time_taken:.6f}, result {result}")
for wrapper in (repeat_with_list, repeat_with_numpy):
 for fn in (original_collect_stickers, set_collect_stickers):
 time_fn(f"{wrapper.__name__} {fn.__name__}", lambda: wrapper(fn))

The result is

repeat_with_list original_collect_stickers: 0.747183, result 71.7912
repeat_with_list set_collect_stickers: 0.688952, result 72.1002
repeat_with_numpy original_collect_stickers: 0.752644, result 72.0978
repeat_with_numpy set_collect_stickers: 0.685355, result 71.7515

EDIT 2

Using the fastrand library's pcg32bounded() generator, i.e. new_sticker = fastrand.pcg32bounded(collectible) makes things plenty fast:

repeat_with_list original_collect_stickers: 0.761186, result 72.0185
repeat_with_list set_collect_stickers: 0.690244, result 71.878
repeat_with_list set_collect_stickers_fastrand: 0.116410, result 71.9323
repeat_with_numpy original_collect_stickers: 0.759154, result 71.8604
repeat_with_numpy set_collect_stickers: 0.696563, result 71.5482
repeat_with_numpy set_collect_stickers_fastrand: 0.114212, result 71.6369

Question 7

Thank you for such a complete answer, it was really helpful!

Oily Oily 7195 silver badges18 bronze badges · Accepted Answer · 2021-03-16 16:16:22Z

To really take into account the power of numpy arrays, you need to program in a numpy way. For example, try to vectorize the experiment like this:

def vectorized():
 rep_experiment = 100000
 collectible = 20
 # array of falses
 obtained = np.zeros(rep_experiment, dtype=bool)
 attempts = np.zeros(rep_experiment, dtype=int)
 targets = np.zeros((rep_experiment, collectible), dtype=bool)
 x = np.arange(0,100000, step=1, dtype=int)
 while False in targets:
 r = np.random.randint(0, collectible, size=rep_experiment)
 # add the new stickers to the collected target
 targets[x,r] = True
 # if collected all set obtained to True
 obtained[np.sum(targets, axis=1)==collectible] = True
 # increments the not obtained values
 attempts[~obtained] += 1
 print(attempts.mean(), attempts.std())

check the speed up, it is around 50 X for me

CollectivesTM on Stack Overflow

Python list performs better than numpy array?

Python list

Results

Numpy

Results

2 Answers 2

EDIT

EDIT 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

Python list

Results

Numpy

Results

2 Answers 2

EDIT

EDIT 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related