arr = np.array([Myclass(np.random.random(100)) for _ in range(10000)])
Is there a way to save time in this statement by creating a numpy array of objects directly (avoiding the list construction which is costly)?
I need to create and process a large number of objects of class Myclass, where each object contains several int’s, several float’s, and a list (or tuple) of floats. The point of using the array (of objects) is to take advantage of numpy array’s fast computation (e.g., column-sums) on slices of the array of objects (and other stuff; the array on which slices are taken has each row made up of one Myclass object and other scalar fields). Other than using the np.array (as above), is there any other time-saving strategy in this case? Thanks.
1 Answer 1
Numpy needs to know the length of the array in advance because it must allocate enough memory in a block.
You can start with an empty array of appropriate type using np.empty(10_000, object)
. (Beware that for most data types empty arrays may contain garbage data, it's usually safer to start with np.zeros()
unless you really need the performance, but dtype object does get properly initialized to None
s.)
You can then apply any callable you like (like a class) over all the values using np.vectorize
. It's faster to use the included vectorized functions when you can instead of converting them, since vectorize
basically has to call it for each element in a for loop. But sometimes you can't.
In the case of random numbers, you can create an array sample of any shape you like using np.random.rand()
. It would still have to be converted to a new array of dtype object when you apply your class to it though. I'm not sure if that's any faster than creating the samples in each __init__
(or whatever callable). You'd have to profile it.
3 Comments
object
dtype is a special case for numpy.empty
; the array is initialized full of references to None
instead of remaining uninitialized. It would be a massive source of segfaults, bug reports, and security issues otherwise, as well as requiring NumPy to somehow track that an array cell does not contain a valid reference and should not be Py_DECREF
'ed when a new reference is written into the cell.np.frompyfunc
is faster than np.vectorize
. They're related, but one is more direct.
np.frompyfunc
to build and operate on arrays of custom class objects. It can be faster than iteration or list comprehensions. But the speed is no where as fast as operations on a numeric dtype array. Do some small scale testing, including times, before you invest too much effort into this project. Also look at using structured arrays to collect these attributes.arr = np.frompyfunc(lambda x: Myclass(np.random.random(x)), 1,1)(np.full(10000, 100))