Speed up custom operations on large datasets with universal functions

Speed up custom operations on large datasets with universal functions#

Vectorisation refers to the application of the same calculation to an entire array at once. This is key in improving the performance of array manipulation. So much so that between numpy, xarray, and scipy practially all conceivable operations have already been vectorised.

But as we learn from time to time, researchers always find new an exciting calculations to do, and out come the for loops. For loops are easy to understand and it’s not a bad idea to try out certain data interactions.

But python loops are terribly inefficient, as you’ll see shortly, and for the actual number crunching, you should use vectorisation.

Let’s first load the relevant modules, then look at a few examples.

import numpy as np
import pandas as pd
import xarray as xr
from dask.distributed import Client
import dask.array as da
import scipy.stats as stats

Generate some data for the testing#

The following function just creates some data array with random numbers in it. I will use this function several times over the course of this article to generate data arrays of various shapes and sizes, but the actualy contents of this function is not that relevant.

def create_dataarray(nlat, nlon, ntime=None, seed=1000):

    np.random.seed(seed)
    
    lat = np.linspace(-90, 90, nlat, endpoint=True)
    lat = xr.DataArray(lat, dims=('lat',), coords={'lat': lat}, attrs={'units': 'degree_north', 'name': 'Latitude'})
    
    lon = np.linspace(-180, 180, nlon+1, endpoint=True)[1:]
    lon = xr.DataArray(lon, dims=('lon',), coords={'lon': lon}, attrs={'units': 'degree_east', 'name': 'Longitude'})

    if ntime is not None:
        time = pd.date_range(start='2000-01-01', freq='D', periods=ntime)
        time = xr.DataArray(time, dims=('time',))
        return xr.DataArray(
            np.random.random([ntime, nlat, nlon]), 
            dims=('time', 'lat', 'lon'),
            coords={'time': time, 'lat': lat, 'lon': lon},
            attrs={'name': 'random'}
        )

    return xr.DataArray(
        np.random.random([nlat, nlon]), 
        dims=('lat', 'lon'),
        coords={'lat': lat, 'lon': lon},
        attrs={'name': 'random'}
    )

Using dask for large datasets#

Normally we don’t run our correlation experiment on such small and easy-to-handle datasets. For larger datasets, we use dask for parallelisation.

Let’s create a larger dataset. This dataset contains almost 2 Gigabyte of uncompressed data, just enough to be managed on a normal computer like mine, but certainly in a range where parallelisation should show dividends.

large_da = create_dataarray(181, 360, 3650)
comparison_da = large_da.isel(lat=0, lon=-1)

Let’s first try this without dask to see some baseline.

%%time
o = xr.apply_ufunc(
    my_pearson_v,
    large_da,
    comparison_da,
    input_core_dims=[('time',), ('time',)],
)

CPU times: user 22.1 s, sys: 26 ms, total: 22.1 s
Wall time: 22.2 s

In order to use dask, we need to create a dask cluster. This is done by this call:

if not 'c' in locals():
    c = Client(n_workers=4, threads_per_worker=1, memory_limit='3.5GB')
c

Client

Client-ea22e8dc-9af7-11ee-9b39-52aaff7ca351

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

LocalCluster

314f1753

Dashboard: http://127.0.0.1:8787/status	Workers: 4
Total threads: 4	Total memory: 13.04 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-4aeaf7ec-9b5b-472e-acf5-6e7f21b9658f

Comm: tcp://127.0.0.1:52271	Workers: 4
Dashboard: http://127.0.0.1:8787/status	Total threads: 4
Started: Just now	Total memory: 13.04 GiB

Workers

Worker: 0

Comm: tcp://127.0.0.1:52283	Total threads: 1
Dashboard: http://127.0.0.1:52286/status	Memory: 3.26 GiB
Nanny: tcp://127.0.0.1:52274
Local directory: /var/folders/t9/v6vzzpvj6yzd0jksh20nhn6997cvgz/T/dask-worker-space/worker-t3ji3wwr

Worker: 1

Comm: tcp://127.0.0.1:52282	Total threads: 1
Dashboard: http://127.0.0.1:52285/status	Memory: 3.26 GiB
Nanny: tcp://127.0.0.1:52275
Local directory: /var/folders/t9/v6vzzpvj6yzd0jksh20nhn6997cvgz/T/dask-worker-space/worker-dwiwb4qp

Worker: 2

Comm: tcp://127.0.0.1:52287	Total threads: 1
Dashboard: http://127.0.0.1:52292/status	Memory: 3.26 GiB
Nanny: tcp://127.0.0.1:52276
Local directory: /var/folders/t9/v6vzzpvj6yzd0jksh20nhn6997cvgz/T/dask-worker-space/worker-2nrz8iir

Worker: 3

Comm: tcp://127.0.0.1:52284	Total threads: 1
Dashboard: http://127.0.0.1:52290/status	Memory: 3.26 GiB
Nanny: tcp://127.0.0.1:52277
Local directory: /var/folders/t9/v6vzzpvj6yzd0jksh20nhn6997cvgz/T/dask-worker-space/worker-ypp5v2kr

By rechunking, we turn the large_da data array into a dask array. Let’s check how long it takes to compute the correlation without parallelisation.

large_da = large_da.chunk({'lat': 91, 'lon': 90})
large_da

<xarray.DataArray (time: 3650, lat: 181, lon: 360)>
dask.array<xarray-<this-array>, shape=(3650, 181, 360), dtype=float64, chunksize=(3650, 91, 90), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2009-12-28
  * lat      (lat) float64 -90.0 -89.0 -88.0 -87.0 -86.0 ... 87.0 88.0 89.0 90.0
  * lon      (lon) float64 -179.0 -178.0 -177.0 -176.0 ... 178.0 179.0 180.0
Attributes:
    name:     random

For dask, we need two new parameters:

dask="parallelized" to tell the system to actually use dask, and
output_dtypes=["float"] to tell it what to expect as a result.

%%time
r = xr.apply_ufunc(
    my_pearson_v,
    large_da,
    comparison_da,
    input_core_dims=[('time',), ('time',)],
    dask='parallelized',
    output_dtypes=['float']
)

CPU times: user 2.63 ms, sys: 1.41 ms, total: 4.04 ms
Wall time: 3.54 ms

Because of lazy loading, the cell above returned immediately. It hasn’t done any of the calculations yet, just stored what it’s supposed to to.

Only in the next cell, where we tell it to actually compute the output, will it take time.

# Ensure that the perfect correlation still exists 
# where we have extracted the actual comparison array
p.isel(lat=0, lon=-1).item()

0.9999999999999998

Dask’s Generalised Universal Functions#

dask.array has its own method gufunc similar to numpy.vectorize to create generalised universal functions that create functions optimised for dask parallelisation.

It is described in detail here. In addition to the parameters of numpy.vectorize we also give it two new parameters:

vectorize=True to tell it that the function still needs to be vectorised, and
output_dtypes=np.float64 similar to the call above: The type of the elements of the output arrays.

my_pearson_g = da.gufunc(
    my_pearson, 
    signature='(n),(n) -> ()',
    vectorize=True,
    output_dtypes=np.float64
)

r = xr.apply_ufunc(
    my_pearson_g,
    large_da,
    comparison_da,
    input_core_dims=[('time',), ('time',)],
    dask='allowed',
)

%%time
p2 = r.compute()
p2.isel(lat=0, lon=-1).item()

CPU times: user 511 ms, sys: 1.16 s, total: 1.67 s
Wall time: 8.2 s

0.9999999999999998

So even at this size of array we can see performance improvements. As the data gets larger, maybe even too large to be handled in memory, these universal functions are the difference between feasible and impossible.

Conclusion#

Xarray’s apply_ufunc method can seriously speed up calculations along dimensions in multi-dimensional xarray dataarrays.

This option is very powerful, but as always this comes with added complexity.

I hope this short document will help you get in the mindset of the function and help you master your universal functions.

Speed up custom operations on large datasets with universal functions

Contents

Speed up custom operations on large datasets with universal functions#

Generate some data for the testing#

Scalar Functions#

Performance Comparison of Vectorisation#

Conclusion for scalar functions#

Reduction functions#

Core dimensions#

Vectorisation, again#

Functions that take and return an array#

Differing length between input and output core dimension#

Applying a ufunc over multiple arrays#