GroupBy: Group and Bin Data#

Often we want to bin or group data, produce statistics (mean, variance) on the groups, and then return a reduced data set. To do this, Xarray supports "group by" operations with the same API as pandas to implement the split-apply-combine strategy:

  • Split your data into multiple independent groups.

  • Apply some function to each group.

  • Combine your groups back into a single data object.

Group by operations work on both Dataset and DataArray objects. Most of the examples focus on grouping by a single one-dimensional variable, although support for grouping over a multi-dimensional variable has recently been implemented. Note that for one-dimensional data, it is usually faster to rely on pandas’ implementation of the same pipeline.

Tip

Install the flox package to substantially improve the performance of GroupBy operations, particularly with dask. flox extends Xarray’s in-built GroupBy capabilities by allowing grouping by multiple variables, and lazy grouping by dask arrays. If installed, Xarray will automatically use flox by default.

Split#

Let’s create a simple example dataset:

ds = xr.Dataset(
 {"foo": (("x", "y"), np.random.rand(4, 3))},
 coords={"x": [10, 20, 30, 40], "letters": ("x", list("abba"))},
)
arr = ds["foo"]
ds
<xarray.Dataset> Size: 144B
Dimensions: (x: 4, y: 3)
Coordinates:
 * x (x) int64 32B 10 20 30 40
 letters (x) <U1 16B 'a' 'b' 'b' 'a'
Dimensions without coordinates: y
Data variables:
 foo (x, y) float64 96B 0.127 0.9667 0.2605 0.8972 ... 0.543 0.373 0.448
xarray.Dataset
    • x: 4
    • y: 3
    • x
      (x)
      int64
      10 20 30 40
      array([10, 20, 30, 40])
    • letters
      (x)
      <U1
      'a' 'b' 'b' 'a'
      array(['a', 'b', 'b', 'a'], dtype='<U1')
    • foo
      (x, y)
      float64
      0.127 0.9667 0.2605 ... 0.373 0.448
      array([[0.12696983, 0.96671784, 0.26047601],
       [0.89723652, 0.37674972, 0.33622174],
       [0.45137647, 0.84025508, 0.12310214],
       [0.5430262 , 0.37301223, 0.44799682]])

If we groupby the name of a variable or coordinate in a dataset (we can also use a DataArray directly), we get back a GroupBy object:

ds.groupby("letters")
<DatasetGroupBy, grouped over 1 grouper(s), 2 groups in total:
 'letters': UniqueGrouper('letters'), 2/2 groups with labels 'a', 'b'>

This object works very similarly to a pandas GroupBy object. You can view the group indices with the groups attribute:

ds.groupby("letters").groups
{'a': [0, 3], 'b': [1, 2]}

You can also iterate over groups in (label, group) pairs:

list(ds.groupby("letters"))
[('a',
 <xarray.Dataset> Size: 72B
 Dimensions: (x: 2, y: 3)
 Coordinates:
 * x (x) int64 16B 10 40
 letters (x) <U1 8B 'a' 'a'
 Dimensions without coordinates: y
 Data variables:
 foo (x, y) float64 48B 0.127 0.9667 0.2605 0.543 0.373 0.448),
 ('b',
 <xarray.Dataset> Size: 72B
 Dimensions: (x: 2, y: 3)
 Coordinates:
 * x (x) int64 16B 20 30
 letters (x) <U1 8B 'b' 'b'
 Dimensions without coordinates: y
 Data variables:
 foo (x, y) float64 48B 0.8972 0.3767 0.3362 0.4514 0.8403 0.1231)]

You can index out a particular group:

ds.groupby("letters")["b"]
<xarray.Dataset> Size: 72B
Dimensions: (x: 2, y: 3)
Coordinates:
 * x (x) int64 16B 20 30
 letters (x) <U1 8B 'b' 'b'
Dimensions without coordinates: y
Data variables:
 foo (x, y) float64 48B 0.8972 0.3767 0.3362 0.4514 0.8403 0.1231
xarray.Dataset
    • x: 2
    • y: 3
    • x
      (x)
      int64
      20 30
      array([20, 30])
    • letters
      (x)
      <U1
      'b' 'b'
      array(['b', 'b'], dtype='<U1')
    • foo
      (x, y)
      float64
      0.8972 0.3767 ... 0.8403 0.1231
      array([[0.89723652, 0.37674972, 0.33622174],
       [0.45137647, 0.84025508, 0.12310214]])

To group by multiple variables, see this section.

Binning#

Sometimes you don’t want to use all the unique values to determine the groups but instead want to "bin" the data into coarser groups. You could always create a customized coordinate, but xarray facilitates this via the Dataset.groupby_bins() method.

x_bins = [0, 25, 50]
ds.groupby_bins("x", x_bins).groups
{Interval(0, 25, closed='right'): [0, 1],
 Interval(25, 50, closed='right'): [2, 3]}

The binning is implemented via pandas.cut(), whose documentation details how the bins are assigned. As seen in the example above, by default, the bins are labeled with strings using set notation to precisely identify the bin limits. To override this behavior, you can specify the bin labels explicitly. Here we choose float labels which identify the bin centers:

x_bin_labels = [12.5, 37.5]
ds.groupby_bins("x", x_bins, labels=x_bin_labels).groups
{np.float64(12.5): [0, 1], np.float64(37.5): [2, 3]}

Apply#

To apply a function to each group, you can use the flexible core.groupby.DatasetGroupBy.map() method. The resulting objects are automatically concatenated back together along the group axis:

defstandardize(x):
 return (x - x.mean()) / x.std()
arr.groupby("letters").map(standardize)
<xarray.DataArray 'foo' (x: 4, y: 3)> Size: 96B
array([[-1.22977845, 1.93741005, -0.72624738],
 [ 1.41979574, -0.46019243, -0.60657867],
 [-0.19064205, 1.21397989, -1.3763625 ],
 [ 0.33941723, -0.30180645, -0.01899499]])
Coordinates:
 * x (x) int64 32B 10 20 30 40
 letters (x) <U1 16B 'a' 'b' 'b' 'a'
Dimensions without coordinates: y
xarray.DataArray
'foo'
  • x: 4
  • y: 3
  • -1.23 1.937 -0.7262 1.42 -0.4602 ... -1.376 0.3394 -0.3018 -0.01899
    array([[-1.22977845, 1.93741005, -0.72624738],
     [ 1.41979574, -0.46019243, -0.60657867],
     [-0.19064205, 1.21397989, -1.3763625 ],
     [ 0.33941723, -0.30180645, -0.01899499]])
    • x
      (x)
      int64
      10 20 30 40
      array([10, 20, 30, 40])
    • letters
      (x)
      <U1
      'a' 'b' 'b' 'a'
      array(['a', 'b', 'b', 'a'], dtype='<U1')

GroupBy objects also have a core.groupby.DatasetGroupBy.reduce() method and methods like core.groupby.DatasetGroupBy.mean() as shortcuts for applying an aggregation function:

arr.groupby("letters").mean(dim="x")
<xarray.DataArray 'foo' (letters: 2, y: 3)> Size: 48B
array([[0.33499802, 0.66986503, 0.35423642],
 [0.6743065 , 0.6085024 , 0.22966194]])
Coordinates:
 * letters (letters) object 16B 'a' 'b'
Dimensions without coordinates: y
xarray.DataArray
'foo'
  • letters: 2
  • y: 3
  • 0.335 0.6699 0.3542 0.6743 0.6085 0.2297
    array([[0.33499802, 0.66986503, 0.35423642],
     [0.6743065 , 0.6085024 , 0.22966194]])
    • letters
      (letters)
      object
      'a' 'b'
      array(['a', 'b'], dtype=object)

Using a groupby is thus also a convenient shortcut for aggregating over all dimensions other than the provided one:

ds.groupby("x").std(...)
<xarray.Dataset> Size: 64B
Dimensions: (x: 4)
Coordinates:
 * x (x) int64 32B 10 20 30 40
Data variables:
 foo (x) float64 32B 0.3684 0.2554 0.2931 0.06957
xarray.Dataset
    • x: 4
    • x
      (x)
      int64
      10 20 30 40
      array([10, 20, 30, 40])
    • foo
      (x)
      float64
      0.3684 0.2554 0.2931 0.06957
      array([0.36844691, 0.25544876, 0.29312473, 0.06956853])

Note

We use an ellipsis (...) here to indicate we want to reduce over all other dimensions

First and last#

There are two special aggregation operations that are currently only found on groupby objects: first and last. These provide the first or last example of values for group along the grouped dimension:

ds.groupby("letters").first(...)
<xarray.Dataset> Size: 64B
Dimensions: (letters: 2, y: 3)
Coordinates:
 * letters (letters) object 16B 'a' 'b'
Dimensions without coordinates: y
Data variables:
 foo (letters, y) float64 48B 0.127 0.9667 0.2605 0.8972 0.3767 0.3362
xarray.Dataset
    • letters: 2
    • y: 3
    • letters
      (letters)
      object
      'a' 'b'
      array(['a', 'b'], dtype=object)
    • foo
      (letters, y)
      float64
      0.127 0.9667 ... 0.3767 0.3362
      array([[0.12696983, 0.96671784, 0.26047601],
       [0.89723652, 0.37674972, 0.33622174]])

By default, they skip missing values (control this with skipna).

Grouped arithmetic#

GroupBy objects also support a limited set of binary arithmetic operations, as a shortcut for mapping over all unique labels. Binary arithmetic is supported for (GroupBy, Dataset) and (GroupBy, DataArray) pairs, as long as the dataset or data array uses the unique grouped values as one of its index coordinates. For example:

alt = arr.groupby("letters").mean(...)
alt
<xarray.DataArray 'foo' (letters: 2)> Size: 16B
array([0.45303315, 0.50415695])
Coordinates:
 * letters (letters) object 16B 'a' 'b'
xarray.DataArray
'foo'
  • letters: 2
  • 0.453 0.5042
    array([0.45303315, 0.50415695])
    • letters
      (letters)
      object
      'a' 'b'
      array(['a', 'b'], dtype=object)
ds.groupby("letters") - alt
<xarray.Dataset> Size: 144B
Dimensions: (x: 4, y: 3)
Coordinates:
 * x (x) int64 32B 10 20 30 40
 letters (x) <U1 16B 'a' 'b' 'b' 'a'
Dimensions without coordinates: y
Data variables:
 foo (x, y) float64 96B -0.3261 0.5137 -0.1926 ... -0.08002 -0.005036
xarray.Dataset
    • x: 4
    • y: 3
    • x
      (x)
      int64
      10 20 30 40
      array([10, 20, 30, 40])
    • letters
      (x)
      <U1
      'a' 'b' 'b' 'a'
      array(['a', 'b', 'b', 'a'], dtype='<U1')
    • foo
      (x, y)
      float64
      -0.3261 0.5137 ... -0.005036
      array([[-0.32606332, 0.51368468, -0.19255715],
       [ 0.39307958, -0.12740723, -0.1679352 ],
       [-0.05278048, 0.33609814, -0.3810548 ],
       [ 0.08999305, -0.08002093, -0.00503633]])

This last line is roughly equivalent to the following:

results = []
for label, group in ds.groupby('letters'):
 results.append(group - alt.sel(letters=label))
xr.concat(results, dim='x')

Multidimensional Grouping#

Many datasets have a multidimensional coordinate variable (e.g. longitude) which is different from the logical grid dimensions (e.g. nx, ny). Such variables are valid under the CF conventions. Xarray supports groupby operations over multidimensional coordinate variables:

da = xr.DataArray(
 [[0, 1], [2, 3]],
 coords={
 "lon": (["ny", "nx"], [[30, 40], [40, 50]]),
 "lat": (["ny", "nx"], [[10, 10], [20, 20]]),
 },
 dims=["ny", "nx"],
)
da
<xarray.DataArray (ny: 2, nx: 2)> Size: 32B
array([[0, 1],
 [2, 3]])
Coordinates:
 lon (ny, nx) int64 32B 30 40 40 50
 lat (ny, nx) int64 32B 10 10 20 20
Dimensions without coordinates: ny, nx
xarray.DataArray
  • ny: 2
  • nx: 2
  • 0 1 2 3
    array([[0, 1],
     [2, 3]])
    • lon
      (ny, nx)
      int64
      30 40 40 50
      array([[30, 40],
       [40, 50]])
    • lat
      (ny, nx)
      int64
      10 10 20 20
      array([[10, 10],
       [20, 20]])
da.groupby("lon").sum(...)
<xarray.DataArray (lon: 3)> Size: 24B
array([0, 3, 3])
Coordinates:
 * lon (lon) int64 24B 30 40 50
xarray.DataArray
  • lon: 3
  • 0 3 3
    array([0, 3, 3])
    • lon
      (lon)
      int64
      30 40 50
      array([30, 40, 50])
da.groupby("lon").map(lambda x: x - x.mean(), shortcut=False)
<xarray.DataArray (ny: 2, nx: 2)> Size: 32B
array([[ 0. , -0.5],
 [ 0.5, 0. ]])
Coordinates:
 lon (ny, nx) int64 32B 30 40 40 50
 lat (ny, nx) int64 32B 10 10 20 20
Dimensions without coordinates: ny, nx
xarray.DataArray
  • ny: 2
  • nx: 2
  • 0.0 -0.5 0.5 0.0
    array([[ 0. , -0.5],
     [ 0.5, 0. ]])
    • lon
      (ny, nx)
      int64
      30 40 40 50
      array([[30, 40],
       [40, 50]])
    • lat
      (ny, nx)
      int64
      10 10 20 20
      array([[10, 10],
       [20, 20]])

Because multidimensional groups have the ability to generate a very large number of bins, coarse-binning via Dataset.groupby_bins() may be desirable:

da.groupby_bins("lon", [0, 45, 50]).sum()
<xarray.DataArray (lon_bins: 2)> Size: 16B
array([3, 3])
Coordinates:
 * lon_bins (lon_bins) interval[int64, right] 32B (0, 45] (45, 50]
xarray.DataArray
  • lon_bins: 2
  • 3 3
    array([3, 3])
    • lon_bins
      (lon_bins)
      interval[int64, right]
      (0, 45] (45, 50]
      <IntervalArray>
      [(0, 45], (45, 50]]
      Length: 2, dtype: interval[int64, right]

These methods group by lon values. It is also possible to groupby each cell in a grid, regardless of value, by stacking multiple dimensions, applying your function, and then unstacking the result:

stacked = da.stack(gridcell=["ny", "nx"])
stacked.groupby("gridcell").sum(...).unstack("gridcell")
<xarray.DataArray (ny: 2, nx: 2)> Size: 32B
array([[0, 1],
 [2, 3]])
Coordinates:
 * ny (ny) int64 16B 0 1
 * nx (nx) int64 16B 0 1
xarray.DataArray
  • ny: 2
  • nx: 2
  • 0 1 2 3
    array([[0, 1],
     [2, 3]])
    • ny
      (ny)
      int64
      0 1
      [2 values with dtype=int64]
    • nx
      (nx)
      int64
      0 1
      [2 values with dtype=int64]

Alternatively, you can groupby both lat and lon at the same time.

Grouper Objects#

Both groupby_bins and resample are specializations of the core groupby operation for binning, and time resampling. Many problems demand more complex GroupBy application: for example, grouping by multiple variables with a combination of categorical grouping, binning, and resampling; or more specializations like spatial resampling; or more complex time grouping like special handling of seasons, or the ability to specify custom seasons. To handle these use-cases and more, Xarray is evolving to providing an extension point using Grouper objects.

Tip

See the grouper design doc for more detail on the motivation and design ideas behind Grouper objects.

For now Xarray provides three specialized Grouper objects:

  1. groupers.UniqueGrouper for categorical grouping

  2. groupers.BinGrouper for binned grouping

  3. groupers.TimeResampler for resampling along a datetime coordinate

These provide functionality identical to the existing groupby, groupby_bins, and resample methods. That is,

ds.groupby("x")

is identical to

fromxarray.groupersimport UniqueGrouper
ds.groupby(x=UniqueGrouper())

Similarly,

ds.groupby_bins("x", bins=bins)

is identical to

fromxarray.groupersimport BinGrouper
ds.groupby(x=BinGrouper(bins))

and

ds.resample(time="ME")

is identical to

fromxarray.groupersimport TimeResampler
ds.resample(time=TimeResampler("ME"))

The groupers.UniqueGrouper accepts an optional labels kwarg that is not present in DataArray.groupby() or Dataset.groupby(). Specifying labels is required when grouping by a lazy array type (e.g. dask or cubed). The labels are used to construct the output coordinate (say for a reduction), and aggregations will only be run over the specified labels. You may use labels to also specify the ordering of groups to be used during iteration. The order will be preserved in the output.

Grouping by multiple variables#

Use grouper objects to group by multiple dimensions:

fromxarray.groupersimport UniqueGrouper
da.groupby(["lat", "lon"]).sum()
<xarray.DataArray (lat: 2, lon: 3)> Size: 48B
array([[ 0., 1., nan],
 [nan, 2., 3.]])
Coordinates:
 * lat (lat) int64 16B 10 20
 * lon (lon) int64 24B 30 40 50
xarray.DataArray
  • lat: 2
  • lon: 3
  • 0.0 1.0 nan nan 2.0 3.0
    array([[ 0., 1., nan],
     [nan, 2., 3.]])
    • lat
      (lat)
      int64
      10 20
      array([10, 20])
    • lon
      (lon)
      int64
      30 40 50
      array([30, 40, 50])

The above is sugar for using UniqueGrouper objects directly:

da.groupby(lat=UniqueGrouper(), lon=UniqueGrouper()).sum()
<xarray.DataArray (lat: 2, lon: 3)> Size: 48B
array([[ 0., 1., nan],
 [nan, 2., 3.]])
Coordinates:
 * lat (lat) int64 16B 10 20
 * lon (lon) int64 24B 30 40 50
xarray.DataArray
  • lat: 2
  • lon: 3
  • 0.0 1.0 nan nan 2.0 3.0
    array([[ 0., 1., nan],
     [nan, 2., 3.]])
    • lat
      (lat)
      int64
      10 20
      array([10, 20])
    • lon
      (lon)
      int64
      30 40 50
      array([30, 40, 50])

Different groupers can be combined to construct sophisticated GroupBy operations.

fromxarray.groupersimport BinGrouper
ds.groupby(x=BinGrouper(bins=[5, 15, 25]), letters=UniqueGrouper()).sum()
<xarray.Dataset> Size: 144B
Dimensions: (y: 3, x_bins: 2, letters: 2)
Coordinates:
 * x_bins (x_bins) interval[int64, right] 32B (5, 15] (15, 25]
 * letters (letters) object 16B 'a' 'b'
Dimensions without coordinates: y
Data variables:
 foo (y, x_bins, letters) float64 96B 0.127 nan nan ... nan nan 0.3362
xarray.Dataset
    • y: 3
    • x_bins: 2
    • letters: 2
    • x_bins
      (x_bins)
      interval[int64, right]
      (5, 15] (15, 25]
      <IntervalArray>
      [(5, 15], (15, 25]]
      Length: 2, dtype: interval[int64, right]
    • letters
      (letters)
      object
      'a' 'b'
      array(['a', 'b'], dtype=object)
    • foo
      (y, x_bins, letters)
      float64
      0.127 nan nan ... nan nan 0.3362
      array([[[0.12696983, nan],
       [ nan, 0.89723652]],
       [[0.96671784, nan],
       [ nan, 0.37674972]],
       [[0.26047601, nan],
       [ nan, 0.33622174]]])

Time Grouping and Resampling#

Shuffling#

Shuffling is a generalization of sorting a DataArray or Dataset by another DataArray, named label for example, that follows from the idea of grouping by label. Shuffling reorders the DataArray or the DataArrays in a Dataset such that all members of a group occur sequentially. For example, Shuffle the object using either DatasetGroupBy or DataArrayGroupBy as appropriate.

da = xr.DataArray(
 dims="x",
 data=[1, 2, 3, 4, 5, 6],
 coords={"label": ("x", "a b c a b c".split(" "))},
)
da.groupby("label").shuffle_to_chunks()
<xarray.DataArray (x: 6)> Size: 48B
array([1, 4, 2, 5, 3, 6])
Coordinates:
 label (x) <U1 24B 'a' 'a' 'b' 'b' 'c' 'c'
Dimensions without coordinates: x
xarray.DataArray
  • x: 6
  • 1 4 2 5 3 6
    array([1, 4, 2, 5, 3, 6])
    • label
      (x)
      <U1
      'a' 'a' 'b' 'b' 'c' 'c'
      array(['a', 'a', 'b', 'b', 'c', 'c'], dtype='<U1')

For chunked array types (e.g. dask or cubed), shuffle may result in a more optimized communication pattern when compared to direct indexing by the appropriate indexer. Shuffling also makes GroupBy operations on chunked arrays an embarrassingly parallel problem, and may significantly improve workloads that use DatasetGroupBy.map() or DataArrayGroupBy.map().