Getting Seasonal Means for the season December-January-February#

We’ve been getting occasionally a few questions about seasonal reduction operations (for example means).

Say you have a dataset that goes on for several years, and you want to compare the seaonal means of each year against each other.

For three of the four seasons this isn’t a big deal, you cut the dataset into yearly chunks, then mean over the season.

But for the fourth season (DJF) this doesn’t work: It bunches together the January, February, and December of that same year, when in fact you want the December from the year before.

Let’s have a look at an overly simplified example.

Our example dataset#

We only use a single dimension of data, and we only use monthly values. What’s more, the values of our “dataset” are strictly ascending.

This keeps the dataset small and we can easily see what’s going on.

The code works just the same way with far more complex datasets

import xarray as xr
import numpy as np
import pandas as pd
time = xr.DataArray(
    pd.date_range(start='2000-01-01', end='2004-12-31', freq='MS'), 
    dims=('time',)
)

data = xr.DataArray(
    np.arange(len(time)), 
    coords={'time': time},
)
data
<xarray.DataArray (time: 60)>
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-02-01 ... 2004-12-01

Initial idea: groupby#

The .groupby method can combine values according to a specific option, and the season is one of these. Let’s try

data.groupby(data.time.dt.season).mean(dim='time')
<xarray.DataArray (season: 4)>
array([28., 30., 27., 33.])
Coordinates:
  * season   (season) object 'DJF' 'JJA' 'MAM' 'SON'

Now this isn’t what we wanted. It has basically averaged over all Decembers, Januarys, and Februaries and stored this as a single value for “DJF”, and the same for “MAM”, “JJA”, and “SON”.

Note that since this is a one-dimensional array, we could have omitted the dim='time' option from the call to mean() but you will probably also have other dimensions, and this way we can guarantee that it will only collapse the time dimension.

Next idea: Resample#

The .resample method changes the frequency of the dataset, and allows for reduction methods to be applied.

data.resample(time='QS').mean(dim='time')
<xarray.DataArray (time: 20)>
array([ 1.,  4.,  7., 10., 13., 16., 19., 22., 25., 28., 31., 34., 37.,
       40., 43., 46., 49., 52., 55., 58.])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-04-01 ... 2004-10-01

We are now much closer. We have quartly means, except that the values don’t correspond to seasons. The solution are Panda’s Anchored Offsets

Basically we can move the start and end of a period around a bit.

You can read more about it here. Quarterly periods use the period letter Q, by default, the time coordinate will be placed at the end of the period. So a single Q will combine January through March, and give it the time coordinate of March 31st.

The first thing we want is to have the time coordinate as the beginning of the period, we do this by appending an S: QS has the same time periods, but places the coordinate at the beginning of the period.

Finally, we want to start the periods in December, so the term we use is QS-DEC. (Note, QS-MAR, QS-JUN, and QS-SEP would have had the exact same effect.)

season_means = data.resample(time='QS-DEC').mean(dim='time')
season_means
<xarray.DataArray (time: 21)>
array([ 0.5,  3. ,  6. ,  9. , 12. , 15. , 18. , 21. , 24. , 27. , 30. ,
       33. , 36. , 39. , 42. , 45. , 48. , 51. , 54. , 57. , 59. ])
Coordinates:
  * time     (time) datetime64[ns] 1999-12-01 2000-03-01 ... 2004-12-01

Extracting the season we’re interested in.#

You might notice that I’ve already assigned the seasonal means to a new variable. I did this because in order to extract just the DJF season we’re interested in, I need the condensed time dimension:

season_means.sel(time=season_means.time.dt.season=='DJF')
<xarray.DataArray (time: 6)>
array([ 0.5, 12. , 24. , 36. , 48. , 59. ])
Coordinates:
  * time     (time) datetime64[ns] 1999-12-01 2000-12-01 ... 2004-12-01

This is the solution. 6 Values, starting with the mean of Jan/Feb 2000, then the mean of Dec 2000 and Jan and Feb 2001, and so on, until the final value is just the mean of Dec 2004.

More flexibility using time masks#

We were lucky above that the seasons are 3 months per season, and the quarterly periods worked so well.

Starting by creating a mask of the values we want. For month-based masks, it makes sense to use the data.time.dt.month. In this case we could simply use the data.time.dt.month.isin([1, 2, 12]) but I want to show how to combine masks using the or operator |.

Then we drop all the values that are not in this region, and finally we use the resampling method again, this time with ‘Annual Ending February’: A-FEB

(Note that we would have gotten the same values for any ‘end’ month outside of the season, but the time coordinate would then point to a different month, and it would be confusing.)

DJF_mask = (data.time.dt.month == 12) | (data.time.dt.month <= 2)
data_DJF = data.where(DJF_mask, drop=True)
data_DJF = data_DJF.resample(time='A-FEB').mean(dim='time')
data_DJF
<xarray.DataArray (time: 6)>
array([ 0.5, 12. , 24. , 36. , 48. , 59. ])
Coordinates:
  * time     (time) datetime64[ns] 2000-02-29 2001-02-28 ... 2005-02-28

I have deliberatly chosen to have the period defined by their ending date, as I want to replace the time coordinate with a year coordinate, and I want the season to be defined by the year that January is in, which is the same year that the February is in.

I can now use the assign_coords and swap_dims methods to change the dimension:

data_DJF = data_DJF.assign_coords(year=data_DJF.time.dt.year)
data_DJF = data_DJF.swap_dims({'time': 'year'})
data_DJF
<xarray.DataArray (year: 6)>
array([ 0.5, 12. , 24. , 36. , 48. , 59. ])
Coordinates:
    time     (year) datetime64[ns] 2000-02-29 2001-02-28 ... 2005-02-28
  * year     (year) int64 2000 2001 2002 2003 2004 2005

If you had wanted the year coordinate to reflect the year that December is in, it’s easiest to replace the resample period from above from A-FEB (Annual Ending February) to AS-DEC (Annual Starting December):

data_DJF = data_DJF.resample(time='AS-DEC').mean(dim='time')

Then the time coordinate of the resulting array would refer to December 1st, and therefore the year would be that of the December.