Basic event statistics

Scott Wales, CLEX CMS

We have a set of timeseries from multiple models, and we’d like to compute some basic statistics on where the values exceed some threshold

import numpy
import xarray
import pandas

Sample data

Some random data as a sample, we have dimensions of ‘model’ and ‘time’

da = xarray.DataArray(numpy.random.random((5,10))-0.5, dims=['model','time'])
da['time'] = pandas.date_range('20010101', periods=da.sizes['time'], freq='MS')
da['model'] = ['A','B','C','D','E']
da
<xarray.DataArray (model: 5, time: 10)>
array([[-0.32756177, -0.26286746,  0.40030123,  0.35854755,  0.43113588,
         0.19164228, -0.24890014, -0.16822426,  0.0288118 ,  0.30851645],
       [ 0.02352896,  0.01700063,  0.40107743,  0.09688021, -0.34731748,
         0.33646959, -0.22218415,  0.00075536,  0.41923523,  0.46450296],
       [-0.36606797,  0.13691949, -0.24407689,  0.26593951, -0.36044558,
        -0.09755949,  0.23047747, -0.30123992,  0.49773007, -0.48146704],
       [-0.28529412, -0.4137749 , -0.49678032, -0.20963922, -0.37296851,
         0.32198888, -0.48008865, -0.35185213, -0.19133728,  0.48969387],
       [ 0.35537076, -0.34188436, -0.46208259,  0.41417063, -0.07029039,
        -0.39533084,  0.21379516, -0.18512164, -0.43433092, -0.29407475]])
Coordinates:
  * time     (time) datetime64[ns] 2001-01-01 2001-02-01 ... 2001-10-01
  * model    (model) <U1 'A' 'B' 'C' 'D' 'E'

Count below threshold for each model

First select only the values below our threshold, then get the count of unmasked values along the time dimension

da.where(da < 0).count('time')
<xarray.DataArray (model: 5)>
array([4, 2, 6, 8, 7])
Coordinates:
  * model    (model) <U1 'A' 'B' 'C' 'D' 'E'

Minimum values for each model

The same idea, this time using ‘min’ instead of ‘count’

da.where(da < 0).min('time')
<xarray.DataArray (model: 5)>
array([-0.32756177, -0.34731748, -0.48146704, -0.49678032, -0.46208259])
Coordinates:
  * model    (model) <U1 'A' 'B' 'C' 'D' 'E'

Date of minimum value for each model

Start off by finding the index of the minimum value - argmin.

min_index = da.where(da < 0).argmin('time')
min_index
<xarray.DataArray (model: 5)>
array([0, 4, 9, 2, 2])
Coordinates:
  * model    (model) <U1 'A' 'B' 'C' 'D' 'E'

These indices can be converted to time values all at once by using min_index as an array index

da.time[min_index]
<xarray.DataArray 'time' (model: 5)>
array(['2001-01-01T00:00:00.000000000', '2001-05-01T00:00:00.000000000',
       '2001-10-01T00:00:00.000000000', '2001-03-01T00:00:00.000000000',
       '2001-03-01T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
    time     (model) datetime64[ns] 2001-01-01 2001-05-01 ... 2001-03-01
  * model    (model) <U1 'A' 'B' 'C' 'D' 'E'

There’s a “ghost” time axis in the result which might cause problems, let’s get rid of it. Our returned values aren’t time dependent

da.time[min_index].drop('time')
<xarray.DataArray 'time' (model: 5)>
array(['2001-01-01T00:00:00.000000000', '2001-05-01T00:00:00.000000000',
       '2001-10-01T00:00:00.000000000', '2001-03-01T00:00:00.000000000',
       '2001-03-01T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
  * model    (model) <U1 'A' 'B' 'C' 'D' 'E'

Putting it together

It could be helpful to put all of the results into a single dataset

event_stats = xarray.Dataset({
    'count': da.where(da < 0).count('time'),
    'min': da.where(da < 0).min('time'),
    'min_date': da.time[min_index].drop('time'),
})

event_stats
<xarray.Dataset>
Dimensions:   (model: 5)
Coordinates:
  * model     (model) <U1 'A' 'B' 'C' 'D' 'E'
Data variables:
    count     (model) int64 4 2 6 8 7
    min       (model) float64 -0.3276 -0.3473 -0.4815 -0.4968 -0.4621
    min_date  (model) datetime64[ns] 2001-01-01 2001-05-01 ... 2001-03-01

That way you can e.g. get all the values for a specific model

event_stats.sel(model='A')
<xarray.Dataset>
Dimensions:   ()
Coordinates:
    model     <U1 'A'
Data variables:
    count     int64 4
    min       float64 -0.3276
    min_date  datetime64[ns] 2001-01-01

Climtas event detection

The climtas library has some functions for event detection - it will find runs where a condition is true for multiple times, then let you run a function to calculate statistics for each event

import climtas

events = climtas.event.find_events(da < 0, min_duration=2)
events

time model event_duration
0 0 0 2
1 1 4 2
2 0 3 5
3 4 2 2
4 4 4 2
5 6 0 2
6 6 3 3
7 7 4 3
coords = climtas.event.event_coords(da, events)

def stat_func(event_data):
    return {'min': event_data.min().values,
            'min_date': event_data.time[event_data.argmin()].values}
    
minimum = climtas.event.map_events(da, events, stat_func)

coords.join(minimum)
model time event_duration min min_date
0 A 2001-01-01 31 days -0.32756177046349233 2001-01-01
1 E 2001-02-01 28 days -0.4620825909697166 2001-03-01
2 D 2001-01-01 120 days -0.49678032011641426 2001-03-01
3 C 2001-05-01 31 days -0.36044557563610247 2001-05-01
4 E 2001-05-01 31 days -0.395330843938117 2001-06-01
5 A 2001-07-01 31 days -0.2489001360755888 2001-07-01
6 D 2001-07-01 62 days -0.48008864559888853 2001-07-01
7 E 2001-08-01 61 days -0.4343309190867375 2001-09-01