Groupby in NumPy

Question 1

To avoid the XY problem, here is an example of what I need. Given the sorted integer input array
[1 1 1 2 2 3], I would like to produce the following slices, together with their index:

0: slice(0,3)
1: slice(3,5)
2: slice(5,6)

(Both the index and the slice will be used to index/slice into some other arrays.)

I think this could be solved with the itertools.groupby(). Partly because I am learning NumPy, and partly because evidence suggests that groupby() may be inefficient, I would like to do this in NumPy. The code below seems to do what I want.

from __future__ import print_function
import numpy as np
if __name__=='__main__':
 arr = np.array([1, 1, 1, 2, 2, 3])
 print(arr)
 mask = np.ediff1d(arr, to_begin=1, to_end=1)
 indices = np.where(mask)[0]
 for i in np.arange(indices.size-1):
 print('%d: slice(%d,%d)' % (i, indices[i], indices[i+1]))

Is there a cleaner / more efficient way of doing this?

Question 2

The function scipy.ndimage.find_objects returns exactly the slices that you are looking for:

>>> a = np.array([1, 1, 1, 2, 2, 3])
>>> scipy.ndimage.find_objects(a)
[(slice(0, 3, None),), (slice(3, 5, None),), (slice(5, 6, None),)]

It's hard to make a fair timing comparison without knowing exactly how you are going to use these slices, but a quick test shows that this is about twice as fast as your strategy of using np.ediff2d followed by np.where.

A couple of general points on your code:

You use numpy.where with a single argument. This is the same as numpy.nonzero and it would be clearer in this case to use the latter.
But having made that change, you could use numpy.flatnonzero and so avoid the [0].

It's pointless to create an array if you are only going to iterate over it. Better to use an iterator. So instead of:

for i in np.arange(indices.size-1):
 print('%d: slice(%d,%d)' % (i, indices[i], indices[i+1]))

you can write something like:

for i, j in enumerate(zip(indices, indices[1:])):
 print('{}: {}'.format(i, slice(*j)))

In response to comment: If you find the zip idiom unclear then you could encapsulate it:

from future_builtins import zip # Python 2 only
def pairs(a):
 """Return an iterator over pairs of adjacent elements in a.
 >>> list(pairs([1, 2, 3, 4]))
 [(1, 2), (2, 3), (3, 4)]
 """
 return zip(a[:-1], a[1:])

and then the loop would become:

for i, j in enumerate(pairs(indices)):
 print('{}: {}'.format(i, slice(*j)))

which ought to be clear.

A fancier alternative is to use the (undocumented) numpy.lib.stride_tricks.as_strided to make an array of pairs of adjacent elements without allocating a new array:

from numpy.lib.stride_tricks import as_strided
def pairs(a):
 """Return array of pairs of adjacent elements in a.
 >>> pairs([1, 2, 3, 4])
 array([[1, 2],
 [2, 3],
 [3, 4]])
 """
 a = np.asarray(a)
 return as_strided(a, shape=(a.size - 1, 2), strides=a.strides * 2)

The documented way to make such an array is np.c_[indices[:-1], indices[1:]] but that allocates a new array.

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Accepted Answer · 2014-08-23 17:29:47Z

The function scipy.ndimage.find_objects returns exactly the slices that you are looking for:

>>> a = np.array([1, 1, 1, 2, 2, 3])
>>> scipy.ndimage.find_objects(a)
[(slice(0, 3, None),), (slice(3, 5, None),), (slice(5, 6, None),)]

It's hard to make a fair timing comparison without knowing exactly how you are going to use these slices, but a quick test shows that this is about twice as fast as your strategy of using np.ediff2d followed by np.where.

A couple of general points on your code:

You use numpy.where with a single argument. This is the same as numpy.nonzero and it would be clearer in this case to use the latter.
But having made that change, you could use numpy.flatnonzero and so avoid the [0].

It's pointless to create an array if you are only going to iterate over it. Better to use an iterator. So instead of:

for i in np.arange(indices.size-1):
 print('%d: slice(%d,%d)' % (i, indices[i], indices[i+1]))

you can write something like:

for i, j in enumerate(zip(indices, indices[1:])):
 print('{}: {}'.format(i, slice(*j)))

In response to comment: If you find the zip idiom unclear then you could encapsulate it:

from future_builtins import zip # Python 2 only
def pairs(a):
 """Return an iterator over pairs of adjacent elements in a.
 >>> list(pairs([1, 2, 3, 4]))
 [(1, 2), (2, 3), (3, 4)]
 """
 return zip(a[:-1], a[1:])

and then the loop would become:

for i, j in enumerate(pairs(indices)):
 print('{}: {}'.format(i, slice(*j)))

which ought to be clear.

A fancier alternative is to use the (undocumented) numpy.lib.stride_tricks.as_strided to make an array of pairs of adjacent elements without allocating a new array:

from numpy.lib.stride_tricks import as_strided
def pairs(a):
 """Return array of pairs of adjacent elements in a.
 >>> pairs([1, 2, 3, 4])
 array([[1, 2],
 [2, 3],
 [3, 4]])
 """
 a = np.asarray(a)
 return as_strided(a, shape=(a.size - 1, 2), strides=a.strides * 2)

The documented way to make such an array is np.c_[indices[:-1], indices[1:]] but that allocates a new array.

Stack Exchange Network

Groupby in NumPy

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Groupby in NumPy

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions