Find the value of a variable at the times when another variable is maximum¶

Claire Carouge¶

Let's say you have two variables that vary in space and time. You can easily calculate the maximum of one variable for each spatial point across time. Now, you would like to know what are the values of the second variable at the same time the first reaches a maximum. And this, for each spatial point.¶

As we shall see, it is relatively easy to do with two variables of the same rank, and that is what the where() function is designed for. But, depending on which Python package you are using, this problem can be more complicated with variables of different ranks, or not! In this blog, we will look at both scenarios.

We first need to import some packages, read in some data and calculate the maximum in time at all spatial points. As usual, I'm going to use CMIP data as it's easily accessible.

In [1]:
%matplotlib inline
import xarray as xr
import numpy as np

In [2]:
# Let's get the 2m temperature and the sensible heat flux.
# We do not want to decode the time unit.
# as datetime objects can't be plotted.
ds = xr.open_dataset('/g/data/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-0/historical/mon/atmos/Amon/r1i1p1/latest/tas/tas_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc',
decode_times=False)
ds1 = xr.open_dataset('/g/data/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-0/historical/mon/atmos/Amon/r1i1p1/latest/hfss/hfss_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc',
decode_times=False)
tas = ds.tas
hfss = ds1.hfss

In [3]:
tas

Out[3]:
<xarray.DataArray 'tas' (time: 1872, lat: 145, lon: 192)>
[52116480 values with dtype=float32]
Coordinates:
* time     (time) float64 6.753e+05 6.754e+05 ... 7.323e+05 7.323e+05
* lat      (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
* lon      (lon) float64 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
height   float64 ...
Attributes:
standard_name:     air_temperature
long_name:         Near-Surface Air Temperature
units:             K
cell_methods:      time: mean
cell_measures:     area: areacella
history:           2012-01-18T23:37:46Z altered by CMOR: Treated scalar d...
associated_files:  baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation...
In [4]:
hfss

Out[4]:
<xarray.DataArray 'hfss' (time: 1872, lat: 145, lon: 192)>
[52116480 values with dtype=float32]
Coordinates:
* time     (time) float64 6.753e+05 6.754e+05 ... 7.323e+05 7.323e+05
* lat      (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
* lon      (lon) float64 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
Attributes:
standard_name:     surface_upward_sensible_heat_flux
long_name:         Surface Upward Sensible Heat Flux
units:             W m-2
cell_methods:      time: mean
cell_measures:     area: areacella
history:           2012-01-15T11:36:06Z altered by CMOR: replaced missing...
associated_files:  baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation...
In [5]:
# Calculate the maximum temperature at each spatial point along the time axis:
tas_max=tas.max('time')
tas_max

Out[5]:
<xarray.DataArray 'tas' (lat: 145, lon: 192)>
array([[258.12787, 258.12787, 258.12787, ..., 258.12787, 258.12787, 258.12787],
[259.0565 , 259.0456 , 259.02908, ..., 259.0787 , 259.07742, 259.07217],
[259.79144, 259.71857, 259.64832, ..., 260.0159 , 259.9404 , 259.86655],
...,
[273.00952, 273.0114 , 273.0609 , ..., 273.01083, 273.00726, 273.0072 ],
[273.2563 , 273.26328, 273.27258, ..., 273.25238, 273.25394, 273.25162],
[273.18732, 273.18732, 273.18732, ..., 273.18732, 273.18732, 273.18732]],
dtype=float32)
Coordinates:
* lat      (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
* lon      (lon) float64 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
height   float64 ...

Variables of the same dimensionality¶

Now we want to find the values of the sensible heat flux when the 2m temperature is maximum.¶

In [6]:
# Get the values of the sensible heat flux when temperature is maximum
hfss_at_max= hfss.where(tas == tas_max)
hfss_at_max.mean('time').plot(size=8)
hfss_at_max

CPU times: user 942 ms, sys: 3.44 s, total: 4.38 s
Wall time: 4.78 s


Variables of different dimensionality¶

It really depends a lot on what object you are using. It is really easy with xarray arrays and not so much with numpy arrays

To illustrate this, we are going to look at the problem of finding at what times the 2m temperature is maximum at each point

Solution with xarray¶

In this case, it is as simple as the previous case with variables of the same rank. Note that the result is a full 3D array with lots of missing values (NaN). There are only values when tas is at a maximum for that point in space.

In [7]:
time_at_max=tas.time.where(tas == tas_max)
time_at_max.mean('time').plot(size=8)
time_at_max

Out[7]:
<xarray.DataArray (time: 1872, lat: 145, lon: 192)>
array([[[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan],
...,
[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan]],

[[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan],
...,
[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan]],

...,

[[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan],
...,
[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan]],

[[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan],
...,
[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan]]])
Coordinates:
* time     (time) float64 6.753e+05 6.754e+05 ... 7.323e+05 7.323e+05
height   float64 1.5
* lat      (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
* lon      (lon) float64 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
Attributes:
bounds:         time_bnds
units:          days since 0001-01-01
calendar:       proleptic_gregorian
axis:           T
long_name:      time
standard_name:  time

Note: The time unit is number of days since 0001-01-01, so one would have to convert to a more usable format for scientific usage

Solution for numpy arrays¶

We'll use tas and tas.time again but without using the xarray built-in methods. If I try the numpy equivalent solution:

In [8]:
nptime_at_max = np.where(tas == tas_max, tas.time, np.nan)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-027b2651c8a9> in <module>()
----> 1 nptime_at_max = np.where(tas == tas_max, tas.time, np.nan)

ValueError: operands could not be broadcast together with shapes (1872,145,192) (1872,) () 

This does not work as numpy (and xarray) can only perform a where() operation on arrays of the same shape. When provided with arrays of different shapes (like in this case), both numpy and xarray will try to make them conform with each other by expanding the smallest arrays to the shape of the biggest array. The values of the smallest arrays are copied across all missing dimensions. This process is called broadcasting.

The problem is numpy is quite conservative in its broadcasting rules and can not perform it in the case above. xarray is much better at broadcasting as it uses all the metadata stored in the DataArray to identify the dimensions in each array. That is the reason why DataArray.where() worked above.

If you have numpy arrays, the best solution is to quickly transform your numpy arrays into xarray DataArrays. You only need to name the dimensions. Any name will do, as long as you give the same name for the common dimensions in the :

In [9]:
#DataArray.values will return only the numpy array with the values and none of the metadata stored in the DataArray
new_tas = xr.DataArray(tas.values, dims=('t','l','L'))
new_time = xr.DataArray(tas.time.values, dims='t')
new_max = new_tas.max(dim='t')
new_time_at_max = new_time.where(new_tas == new_max)
new_time_at_max

Out[9]:
<xarray.DataArray (t: 1872, l: 145, L: 192)>
array([[[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan],
...,
[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan]],

[[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan],
...,
[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan]],

...,

[[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan],
...,
[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan]],

[[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan],
...,
[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan]]])
Dimensions without coordinates: t, l, L

As you see, you can keep your values and the time in separate arrays (new_tas and new_time). You don't need to add the time array as a coordinate to the 3D array. Although it can be a good idea to do it in general as it keeps the data self-describing.

Extension¶

The other advantage of using xarray is it allows you to extend the functionalities beyond the built-in functions a lot more easily. The idea here is to use the groupby().apply() workflow to apply a user-defined function. You could obviously use it as a solution to the current problem:

In [10]:
def check_max(data):
return np.where(data == tas_max, data.time, np.nan)

tasmax_dates = tas.groupby('time').apply(check_max)
tasmax_dates.mean('time').plot(size=8)
tasmax_dates

Out[10]:
<xarray.DataArray 'tas' (time: 1872, lat: 145, lon: 192)>
array([[[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan],
...,
[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan]],

[[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan],
...,
[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan]],

...,

[[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan],
...,
[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan]],

[[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan],
...,
[nan, nan, ..., nan, nan],
[nan, nan, ..., nan, nan]]])
Coordinates:
* lat      (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
* lon      (lon) float64 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
height   float64 ...
* time     (time) float64 6.753e+05 6.754e+05 ... 7.323e+05 7.323e+05

But this is slower than using the built-in functions directly. And it doesn't keep the attributes (like time_bnds, units, calendar, etc)! It is then best to keep this approach for more complex problems that can not be easily solved otherwise.