How to create a numpy array of objects without first constructing a list?

Question 1

arr = np.array([Myclass(np.random.random(100)) for _ in range(10000)])

Is there a way to save time in this statement by creating a numpy array of objects directly (avoiding the list construction which is costly)?

I need to create and process a large number of objects of class Myclass, where each object contains several int’s, several float’s, and a list (or tuple) of floats. The point of using the array (of objects) is to take advantage of numpy array’s fast computation (e.g., column-sums) on slices of the array of objects (and other stuff; the array on which slices are taken has each row made up of one Myclass object and other scalar fields). Other than using the np.array (as above), is there any other time-saving strategy in this case? Thanks.

Question 2

In past SO, I've demonstrated using np.frompyfunc to build and operate on arrays of custom class objects. It can be faster than iteration or list comprehensions. But the speed is no where as fast as operations on a numeric dtype array. Do some small scale testing, including times, before you invest too much effort into this project. Also look at using structured arrays to collect these attributes.

Question 3

arr = np.frompyfunc(lambda x: Myclass(np.random.random(x)), 1,1)(np.full(10000, 100))

Question 4

Keep in mind that to make an array of a 1000 custom objects, you have to call the instance constructor once for each object. And the parameters have to come from somewhere, random, arrays, databases etc. Collecting those objects in a list will be a minor part overall process.

Question 5

@hpaulj Thanks a lot. Your one-liner illustrating frompyfunc is particularly helpful. I am grateful to you. I am working on it; the time, as you correctly mentioned, still remains a concern.

Question 6

Numpy needs to know the length of the array in advance because it must allocate enough memory in a block.

You can start with an empty array of appropriate type using np.empty(10_000, object). (Beware that for most data types empty arrays may contain garbage data, it's usually safer to start with np.zeros() unless you really need the performance, but dtype object does get properly initialized to Nones.)

You can then apply any callable you like (like a class) over all the values using np.vectorize. It's faster to use the included vectorized functions when you can instead of converting them, since vectorize basically has to call it for each element in a for loop. But sometimes you can't.

In the case of random numbers, you can create an array sample of any shape you like using np.random.rand(). It would still have to be converted to a new array of dtype object when you apply your class to it though. I'm not sure if that's any faster than creating the samples in each __init__ (or whatever callable). You'd have to profile it.

Question 7

object dtype is a special case for numpy.empty; the array is initialized full of references to None instead of remaining uninitialized. It would be a massive source of segfaults, bug reports, and security issues otherwise, as well as requiring NumPy to somehow track that an array cell does not contain a valid reference and should not be Py_DECREF'ed when a new reference is written into the cell.

Question 8

I figured it was like that, but couldn't find it in the docs. I see it now though.

Question 9

I've found that np.frompyfunc is faster than np.vectorize. They're related, but one is more direct.

gilch gilch 11.9k3 gold badges25 silver badges28 bronze badges · Accepted Answer · 2019-09-29 03:55:48Z

Numpy needs to know the length of the array in advance because it must allocate enough memory in a block.

You can start with an empty array of appropriate type using np.empty(10_000, object). (Beware that for most data types empty arrays may contain garbage data, it's usually safer to start with np.zeros() unless you really need the performance, but dtype object does get properly initialized to Nones.)

You can then apply any callable you like (like a class) over all the values using np.vectorize. It's faster to use the included vectorized functions when you can instead of converting them, since vectorize basically has to call it for each element in a for loop. But sometimes you can't.

In the case of random numbers, you can create an array sample of any shape you like using np.random.rand(). It would still have to be converted to a new array of dtype object when you apply your class to it though. I'm not sure if that's any faster than creating the samples in each __init__ (or whatever callable). You'd have to profile it.

object dtype is a special case for numpy.empty; the array is initialized full of references to None instead of remaining uninitialized. It would be a massive source of segfaults, bug reports, and security issues otherwise, as well as requiring NumPy to somehow track that an array cell does not contain a valid reference and should not be Py_DECREF'ed when a new reference is written into the cell.
I figured it was like that, but couldn't find it in the docs. I see it now though.
I've found that np.frompyfunc is faster than np.vectorize. They're related, but one is more direct.

CollectivesTM on Stack Overflow

How to create a numpy array of objects without first constructing a list?

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related