Basic event statistics#

Scott Wales, CLEX CMS

We have a set of timeseries from multiple models, and we’d like to compute some basic statistics on where the values exceed some threshold

import numpy
import xarray
import pandas

Sample data#

Some random data as a sample, we have dimensions of ‘model’ and ‘time’

da = xarray.DataArray(numpy.random.random((5,10))-0.5, dims=['model','time'])
da['time'] = pandas.date_range('20010101', periods=da.sizes['time'], freq='MS')
da['model'] = ['A','B','C','D','E']
da
<xarray.DataArray (model: 5, time: 10)>
array([[-0.32756177, -0.26286746,  0.40030123,  0.35854755,  0.43113588,
         0.19164228, -0.24890014, -0.16822426,  0.0288118 ,  0.30851645],
       [ 0.02352896,  0.01700063,  0.40107743,  0.09688021, -0.34731748,
         0.33646959, -0.22218415,  0.00075536,  0.41923523,  0.46450296],
       [-0.36606797,  0.13691949, -0.24407689,  0.26593951, -0.36044558,
        -0.09755949,  0.23047747, -0.30123992,  0.49773007, -0.48146704],
       [-0.28529412, -0.4137749 , -0.49678032, -0.20963922, -0.37296851,
         0.32198888, -0.48008865, -0.35185213, -0.19133728,  0.48969387],
       [ 0.35537076, -0.34188436, -0.46208259,  0.41417063, -0.07029039,
        -0.39533084,  0.21379516, -0.18512164, -0.43433092, -0.29407475]])
Coordinates:
  * time     (time) datetime64[ns] 2001-01-01 2001-02-01 ... 2001-10-01
  * model    (model) <U1 'A' 'B' 'C' 'D' 'E'

Count below threshold for each model#

First select only the values below our threshold, then get the count of unmasked values along the time dimension

da.where(da < 0).count('time')
<xarray.DataArray (model: 5)>
array([4, 2, 6, 8, 7])
Coordinates:
  * model    (model) <U1 'A' 'B' 'C' 'D' 'E'

Minimum values for each model#

The same idea, this time using ‘min’ instead of ‘count’

da.where(da < 0).min('time')
<xarray.DataArray (model: 5)>
array([-0.32756177, -0.34731748, -0.48146704, -0.49678032, -0.46208259])
Coordinates:
  * model    (model) <U1 'A' 'B' 'C' 'D' 'E'

Date of minimum value for each model#

Start off by finding the index of the minimum value - argmin.

min_index = da.where(da < 0).argmin('time')
min_index
<xarray.DataArray (model: 5)>
array([0, 4, 9, 2, 2])
Coordinates:
  * model    (model) <U1 'A' 'B' 'C' 'D' 'E'

These indices can be converted to time values all at once by using min_index as an array index

da.time[min_index]
<xarray.DataArray 'time' (model: 5)>
array(['2001-01-01T00:00:00.000000000', '2001-05-01T00:00:00.000000000',
       '2001-10-01T00:00:00.000000000', '2001-03-01T00:00:00.000000000',
       '2001-03-01T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
    time     (model) datetime64[ns] 2001-01-01 2001-05-01 ... 2001-03-01
  * model    (model) <U1 'A' 'B' 'C' 'D' 'E'

There’s a “ghost” time axis in the result which might cause problems, let’s get rid of it. Our returned values aren’t time dependent

da.time[min_index].drop('time')
<xarray.DataArray 'time' (model: 5)>
array(['2001-01-01T00:00:00.000000000', '2001-05-01T00:00:00.000000000',
       '2001-10-01T00:00:00.000000000', '2001-03-01T00:00:00.000000000',
       '2001-03-01T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
  * model    (model) <U1 'A' 'B' 'C' 'D' 'E'

Putting it together#

It could be helpful to put all of the results into a single dataset

event_stats = xarray.Dataset({
    'count': da.where(da < 0).count('time'),
    'min': da.where(da < 0).min('time'),
    'min_date': da.time[min_index].drop('time'),
})

event_stats
<xarray.Dataset>
Dimensions:   (model: 5)
Coordinates:
  * model     (model) <U1 'A' 'B' 'C' 'D' 'E'
Data variables:
    count     (model) int64 4 2 6 8 7
    min       (model) float64 -0.3276 -0.3473 -0.4815 -0.4968 -0.4621
    min_date  (model) datetime64[ns] 2001-01-01 2001-05-01 ... 2001-03-01

That way you can e.g. get all the values for a specific model

event_stats.sel(model='A')
<xarray.Dataset>
Dimensions:   ()
Coordinates:
    model     <U1 'A'
Data variables:
    count     int64 4
    min       float64 -0.3276
    min_date  datetime64[ns] 2001-01-01

Climtas event detection#

The climtas library has some functions for event detection - it will find runs where a condition is true for multiple times, then let you run a function to calculate statistics for each event

import climtas

events = climtas.event.find_events(da < 0, min_duration=2)
events

time model event_duration
0 0 0 2
1 1 4 2
2 0 3 5
3 4 2 2
4 4 4 2
5 6 0 2
6 6 3 3
7 7 4 3
coords = climtas.event.event_coords(da, events)

def stat_func(event_data):
    return {'min': event_data.min().values,
            'min_date': event_data.time[event_data.argmin()].values}
    
minimum = climtas.event.map_events(da, events, stat_func)

coords.join(minimum)
model time event_duration min min_date
0 A 2001-01-01 31 days -0.32756177046349233 2001-01-01
1 E 2001-02-01 28 days -0.4620825909697166 2001-03-01
2 D 2001-01-01 120 days -0.49678032011641426 2001-03-01
3 C 2001-05-01 31 days -0.36044557563610247 2001-05-01
4 E 2001-05-01 31 days -0.395330843938117 2001-06-01
5 A 2001-07-01 31 days -0.2489001360755888 2001-07-01
6 D 2001-07-01 62 days -0.48008864559888853 2001-07-01
7 E 2001-08-01 61 days -0.4343309190867375 2001-09-01