You can run this notebook in a live session Binder or view it on Github.

Toy weather data#

Here is an example of how to easily manipulate a toy weather dataset using xarray and other recommended Python libraries:

[1]:

importnumpyasnp
importpandasaspd
importseabornassns
importxarrayasxr
%matplotlib inline

[2]:

np.random.seed(123)
xr.set_options(display_style="html")
times = pd.date_range("2000年01月01日", "2001年12月31日", name="time")
annual_cycle = np.sin(2 * np.pi * (times.dayofyear.values / 365.25 - 0.28))
base = 10 + 15 * annual_cycle.reshape(-1, 1)
tmin_values = base + 3 * np.random.randn(annual_cycle.size, 3)
tmax_values = base + 10 + 3 * np.random.randn(annual_cycle.size, 3)
ds = xr.Dataset(
 {
 "tmin": (("time", "location"), tmin_values),
 "tmax": (("time", "location"), tmax_values),
 },
 {"time": times, "location": ["IA", "IN", "IL"]},
)
ds

[2]:

<xarray.Dataset> Size: 41kB
Dimensions: (time: 731, location: 3)
Coordinates:
 * time (time) datetime64[us] 6kB 2000年01月01日 2000年01月02日 ... 2001年12月31日
 * location (location) <U2 24B 'IA' 'IN' 'IL'
Data variables:
 tmin (time, location) float64 18kB -8.037 -1.788 ... -1.346 -4.544
 tmax (time, location) float64 18kB 12.98 3.31 6.779 ... 3.343 3.805

xarray.Dataset

Dimensions:
- time: 731
- location: 3

Coordinates: (2)

time

(time)

datetime64[us]

2000年01月01日 ... 2001年12月31日

array(['2000年01月01日T00:00:00.000000', '2000年01月02日T00:00:00.000000',
 '2000年01月03日T00:00:00.000000', ..., '2001年12月29日T00:00:00.000000',
 '2001年12月30日T00:00:00.000000', '2001年12月31日T00:00:00.000000'],
 shape=(731,), dtype='datetime64[us]')

location
(location)
<U2
'IA' 'IN' 'IL'
```
array(['IA', 'IN', 'IL'], dtype='<U2')
```

Data variables: (2)

tmin

(time, location)

float64

-8.037 -1.788 ... -1.346 -4.544

array([[ -8.03736932, -1.78844117, -3.93154201],
 [ -9.34115662, -6.55807323, 0.13203714],
 [-12.13971902, -6.14641918, -1.06187252],
 ...,
 [ -5.34723825, -13.37459826, -4.93221199],
 [ -2.67283594, -5.18072141, -4.11567869],
 [ 2.06327582, -1.34576404, -4.54392729]], shape=(731, 3))

tmax

(time, location)

float64

12.98 3.31 6.779 ... 3.343 3.805

array([[12.98054898, 3.31040942, 6.77855382],
 [ 0.44785582, 6.37271154, 4.8434966 ],
 [ 5.32269851, 6.25176289, 5.98033045],
 ...,
 [ 6.73078492, 7.74795302, 8.04569651],
 [ 6.46376911, 6.31695352, 1.55799171],
 [ 6.63593435, 3.34271537, 3.80527925]], shape=(731, 3))

Examine a dataset with pandas and seaborn#

Convert to a pandas DataFrame#

[3]:

df = ds.to_dataframe()
df.head()

[3]:

tmin	tmax
time	location
2000年01月01日	IA	-8.037369	12.980549
	IN	-1.788441	3.310409
	IL	-3.931542	6.778554
2000年01月02日	IA	-9.341157	0.447856
2000年01月02日	IN	-6.558073	6.372712

[4]:

df.describe()

[4]:

tmin	tmax
count	2193.000000	2193.000000
mean	9.975426	20.108232
std	10.963228	11.010569
min	-13.395763	-3.506234
25%	-0.040347	9.853905
50%	10.060403	19.967409
75%	20.083590	30.045588
max	33.456060	43.271148

Visualize using pandas#

[5]:

ds.mean(dim="location").to_dataframe().plot()

[5]:

<Axes: xlabel='time'>

../_images/examples_weather-data_8_1.png

Visualize using seaborn#

[6]:

sns.pairplot(df.reset_index(), vars=ds.data_vars)

[6]:

<seaborn.axisgrid.PairGrid at 0x7805a9a14c20>

../_images/examples_weather-data_10_1.png

Probability of freeze by calendar month#

[7]:

freeze = (ds["tmin"] <= 0).groupby("time.month").mean("time")
freeze

[7]:

<xarray.DataArray 'tmin' (month: 12, location: 3)> Size: 288B
array([[0.9516129 , 0.88709677, 0.93548387],
 [0.84210526, 0.71929825, 0.77192982],
 [0.24193548, 0.12903226, 0.16129032],
 [0. , 0. , 0. ],
 [0. , 0. , 0. ],
 [0. , 0. , 0. ],
 [0. , 0. , 0. ],
 [0. , 0. , 0. ],
 [0. , 0. , 0. ],
 [0. , 0.01612903, 0. ],
 [0.33333333, 0.35 , 0.23333333],
 [0.93548387, 0.85483871, 0.82258065]])
Coordinates:
 * month (month) int64 96B 1 2 3 4 5 6 7 8 9 10 11 12
 * location (location) <U2 24B 'IA' 'IN' 'IL'

xarray.DataArray

'tmin'

month: 12
location: 3

0.9516 0.8871 0.9355 0.8421 0.7193 ... 0.2333 0.9355 0.8548 0.8226

array([[0.9516129 , 0.88709677, 0.93548387],
 [0.84210526, 0.71929825, 0.77192982],
 [0.24193548, 0.12903226, 0.16129032],
 [0. , 0. , 0. ],
 [0. , 0. , 0. ],
 [0. , 0. , 0. ],
 [0. , 0. , 0. ],
 [0. , 0. , 0. ],
 [0. , 0. , 0. ],
 [0. , 0.01612903, 0. ],
 [0.33333333, 0.35 , 0.23333333],
 [0.93548387, 0.85483871, 0.82258065]])

Coordinates: (2)
- month
  (month)
  int64
  1 2 3 4 5 6 7 8 9 10 11 12
```
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
```
- location
  (location)
  <U2
  'IA' 'IN' 'IL'
```
array(['IA', 'IN', 'IL'], dtype='<U2')
```

[8]:

freeze.to_pandas().plot()

[8]:

<Axes: xlabel='month'>

../_images/examples_weather-data_13_1.png

Monthly averaging#

[9]:

monthly_avg = ds.resample(time="1MS").mean()
monthly_avg.sel(location="IA").to_dataframe().plot(style="s-")

[9]:

<Axes: xlabel='time'>

../_images/examples_weather-data_15_1.png

Note that MS here refers to Month-Start; M labels Month-End (the last day of the month).

Calculate monthly anomalies#

In climatology, "anomalies" refer to the difference between observations and typical weather for a particular season. Unlike observations, anomalies should not show any seasonal cycle.

[10]:

climatology = ds.groupby("time.month").mean("time")
anomalies = ds.groupby("time.month") - climatology
anomalies.mean("location").to_dataframe()[["tmin", "tmax"]].plot()

[10]:

<Axes: xlabel='time'>

../_images/examples_weather-data_19_1.png

Calculate standardized monthly anomalies#

You can create standardized anomalies where the difference between the observations and the climatological monthly mean is divided by the climatological standard deviation.

[11]:

climatology_mean = ds.groupby("time.month").mean("time")
climatology_std = ds.groupby("time.month").std("time")
stand_anomalies = xr.apply_ufunc(
 lambda x, m, s: (x - m) / s,
 ds.groupby("time.month"),
 climatology_mean,
 climatology_std,
)
stand_anomalies.mean("location").to_dataframe()[["tmin", "tmax"]].plot()

[11]:

<Axes: xlabel='time'>

../_images/examples_weather-data_22_1.png

Fill missing values with climatology#

The fillna method on grouped objects lets you easily fill missing values by group:

[12]:

# throw away the first half of every month
some_missing = ds.tmin.sel(time=ds["time.day"] > 15).reindex_like(ds)
filled = some_missing.groupby("time.month").fillna(climatology.tmin)
both = xr.Dataset({"some_missing": some_missing, "filled": filled})
both

[12]:

<xarray.Dataset> Size: 47kB
Dimensions: (time: 731, location: 3)
Coordinates:
 * time (time) datetime64[us] 6kB 2000年01月01日 2000年01月02日 ... 2001年12月31日
 month (time) int64 6kB 1 1 1 1 1 1 1 1 1 ... 12 12 12 12 12 12 12 12
 * location (location) <U2 24B 'IA' 'IN' 'IL'
Data variables:
 some_missing (time, location) float64 18kB nan nan nan ... -1.346 -4.544
 filled (time, location) float64 18kB -5.163 -4.216 ... -1.346 -4.544

xarray.Dataset

Dimensions:
- time: 731
- location: 3

Coordinates: (3)

time

(time)

datetime64[us]

2000年01月01日 ... 2001年12月31日

array(['2000年01月01日T00:00:00.000000', '2000年01月02日T00:00:00.000000',
 '2000年01月03日T00:00:00.000000', ..., '2001年12月29日T00:00:00.000000',
 '2001年12月30日T00:00:00.000000', '2001年12月31日T00:00:00.000000'],
 shape=(731,), dtype='datetime64[us]')

month

(time)

int64

1 1 1 1 1 1 1 ... 12 12 12 12 12 12

array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6,
 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7,
 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8,
 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11,
 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12,
...
 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4,
 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5,
 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9,
 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10,
 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12])

location
(location)
<U2
'IA' 'IN' 'IL'
```
array(['IA', 'IN', 'IL'], dtype='<U2')
```

Data variables: (2)

some_missing

(time, location)

float64

nan nan nan ... 2.063 -1.346 -4.544

array([[ nan, nan, nan],
 [ nan, nan, nan],
 [ nan, nan, nan],
 ...,
 [ -5.34723825, -13.37459826, -4.93221199],
 [ -2.67283594, -5.18072141, -4.11567869],
 [ 2.06327582, -1.34576404, -4.54392729]], shape=(731, 3))

filled

(time, location)

float64

-5.163 -4.216 ... -1.346 -4.544

array([[ -5.16274935, -4.21616663, -4.68137385],
 [ -5.16274935, -4.21616663, -4.68137385],
 [ -5.16274935, -4.21616663, -4.68137385],
 ...,
 [ -5.34723825, -13.37459826, -4.93221199],
 [ -2.67283594, -5.18072141, -4.11567869],
 [ 2.06327582, -1.34576404, -4.54392729]], shape=(731, 3))

[13]:

df = both.sel(time="2000").mean("location").reset_coords(drop=True).to_dataframe()
df.head()

[13]:

some_missing	filled
time
2000年01月01日	NaN	-4.686763
2000年01月02日	NaN	-4.686763
2000年01月03日	NaN	-4.686763
2000年01月04日	NaN	-4.686763
2000年01月05日	NaN	-4.686763

[14]:

df[["filled", "some_missing"]].plot()

[14]:

<Axes: xlabel='time'>

../_images/examples_weather-data_27_1.png