Merge arrays with missing data#

Claire Carouge, CLEX CMS

Let’s say you have 2 datasets coming from different sources but representing the same quantity. You’d like to merge those datasets into a single one via a mean, unfortunately both datasets have missing data at different times and places. Accordingly, we want the merged dataset to follow these rules:

  • if both original datasets have data, we take the mean of both

  • if one dataset only has data, we take this data

  • if the data is missing in both original datasets, we keep a missing data

The strategy using xarray is to open each dataset in a DataArray, concatenate both arrays on a new dimension and then average along this dimension.

import xarray as xr
import numpy as np

First define 2 arrays of same dimensions with missing data at different places

aa = xr.DataArray([[0,1,2],[3,4,np.nan]],dims=('x','y'))
bb = xr.DataArray([[5,np.nan,6],[np.nan,7,np.nan]],dims=('x','y'))
aa
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.,  1.,  2.],
       [ 3.,  4., nan]])
Dimensions without coordinates: x, y
bb
<xarray.DataArray (x: 2, y: 3)>
array([[ 5., nan,  6.],
       [nan,  7., nan]])
Dimensions without coordinates: x, y

Now, if we simply sum the arrays together, we do not get what we want. The missing value take precedence. That is, if any of the array has a missing value, the sum is missing. So summing and dividing by the number of arrays won’t work

aa+bb
<xarray.DataArray (x: 2, y: 3)>
array([[ 5., nan,  8.],
       [nan, 11., nan]])
Dimensions without coordinates: x, y

At the opposite, if we can do a mean, it will work as then the missing value is ignored (mean(1,nan) = 1). For this, we need to “merge” the arrays into a single array. For this we’ll use the xarray.concat() method.

Concatenate the arrays along a new dimension we’ll call z

cc = xr.concat((aa,bb),'z')
cc
<xarray.DataArray (z: 2, x: 2, y: 3)>
array([[[ 0.,  1.,  2.],
        [ 3.,  4., nan]],

       [[ 5., nan,  6.],
        [nan,  7., nan]]])
Dimensions without coordinates: z, x, y

As you see above the concatenation allows us to have the 2 arrays aligned together in a new array. Now we take advantage of the fact xarray handles missing data correctly. That is, a mean will not count missing data.

cc.mean(dim='z')
<xarray.DataArray (x: 2, y: 3)>
array([[2.5, 1. , 4. ],
       [3. , 5.5, nan]])
Dimensions without coordinates: x, y

Usually you would find these last 2 operations combined as you don’t need to store the results of the concat operation.

xr.concat((aa,bb),'z').mean(dim='z')
<xarray.DataArray (x: 2, y: 3)>
array([[2.5, 1. , 4. ],
       [3. , 5.5, nan]])
Dimensions without coordinates: x, y