Python Tuesday: NetCDF Python library overview

Scott Wales, CLEX CMS

Let's take a look at some of the libraries available in the CMS Conda environment for loading NetCDF files.

There are three main libraries available - xarray, netCDF4 and iris. Each let you load a file and work with variables as if they were a numpy array, but each have their own unique features that can be helpful when working with climate datasets.

For the examples I'll be using the following dataset from NCI's CMIP5 archive:

In [1]:
sampledata = 'http://dapds00.nci.org.au/thredds/dodsC/rr3/CMIP5/output1/CSIRO-BOM/ACCESS1-0/amip/mon/atmos/Amon/r1i1p1/latest/tas/tas_Amon_ACCESS1-0_amip_r1i1p1_197901-200812.nc'

Xarray

http://xarray.pydata.org/en/stable/

Xarray is my favourite library for working with NetCDF files - it makes it easy to filter data by coordinate value, rather than having to work out array indices yourself. In combination with the Dask library it also lets you work with very large datasets without having to load everything into memory all at once.

Xarray works with file formats other than NetCDF as well, so some features like compression settings can be inconvenient to set.

In [2]:
import xarray

# Open a file
data = xarray.open_dataset(sampledata)

# Variables can be accessed either as properties or as a dict
surface_temperature = data.tas
surface_temperature = data['tas']

print("Variable:\n", surface_temperature)

# Same for attributes
units = surface_temperature.units
units = surface_temperature.attrs['units']

print()
print("Attribute:\n", units)

# Variables can be indexed numpy-style or pandas-style
d = surface_temperature[0, 0:10, 0:10]
d = surface_temperature.isel(time=0, lat=slice(0,10), lon=slice(0,10))
d = surface_temperature.sel(time='19790116T1200', lat=slice(-90,-80), lon=slice(0,20))

# Data can be saved to a new file easily
data.to_netcdf('data.nc')
Variable:
 <xarray.DataArray 'tas' (time: 360, lat: 145, lon: 192)>
[10022400 values with dtype=float32]
Coordinates:
  * time     (time) datetime64[ns] 1979-01-16T12:00:00 1979-02-15 ...
  * lat      (lat) float64 -90.0 -88.75 -87.5 -86.25 -85.0 -83.75 -82.5 ...
  * lon      (lon) float64 0.0 1.875 3.75 5.625 7.5 9.375 11.25 13.12 15.0 ...
    height   float64 ...
Attributes:
    standard_name:     air_temperature
    long_name:         Near-Surface Air Temperature
    units:             K
    cell_methods:      time: mean
    cell_measures:     area: areacella
    history:           2012-02-17T05:21:51Z altered by CMOR: Treated scalar d...
    associated_files:  baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation...

Attribute:
 K

netCDF4

http://unidata.github.io/netcdf4-python/

The netCDF4 library is a bare-bones library for working with NetCDF data. It doesn't have the bells and whistles of Xarray, but unlike Xarray it's a dedicated library, so features like compression and scale-and-offest are simpler to access.

In [3]:
import netCDF4

data = netCDF4.Dataset(sampledata)

# Variables can be accessed like a dict
surface_temperature = data['tas']
surface_temperature = data.variables['tas']

print("Variable:\n", surface_temperature)

# Attributes are accessed as properties of a variable
units = surface_temperature.units

print("Attribute:\n", units)

# Variables can be indexed numpy-style
data = surface_temperature[0, 0:10, 0:10]

# Data can't be copied to a new file easily
Variable:
 <class 'netCDF4._netCDF4.Variable'>
float32 tas(time, lat, lon)
    standard_name: air_temperature
    long_name: Near-Surface Air Temperature
    units: K
    cell_methods: time: mean
    cell_measures: area: areacella
    history: 2012-02-17T05:21:51Z altered by CMOR: Treated scalar dimension: 'height'. 2012-02-17T05:21:51Z altered by CMOR: replaced missing value flag (-1.07374e+09) with standard missing value (1e+20).
    coordinates: height
    missing_value: 1e+20
    _FillValue: 1e+20
    associated_files: baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation gridspecFile: gridspec_atmos_fx_ACCESS1-0_amip_r0i0p0.nc areacella: areacella_fx_ACCESS1-0_amip_r0i0p0.nc
unlimited dimensions: time
current shape = (360, 145, 192)
filling off

Attribute:
 K

Iris

https://scitools.org.uk/iris

While Xarray and netCDF4 both work similarly, the Iris library works a bit differently. Rather than accessing variables like a dictionary, Iris uses a list with a special function to get a variable by name. It also prefers using CF standard names, some special trickery is requried to get the variable by its name in the file.

Iris also keeps the file-level attributes with each of the variables - you can see below that it lists things like the title and metadata conventions

In [4]:
import iris

data = iris.load(sampledata)

# Variables can be accessed like a list
surface_temperature = data[0]

# Iris prefers to use the standard_name to identify variables
surface_temperature = data.extract_strict('air_temperature')

# Getting variables by their own name can be done, but is complicated
surface_temperature = data.extract_strict(iris.Constraint(cube_func = lambda c: c.var_name == 'tas'))

print("Variable:\n", surface_temperature)

# Attributes can be accessed as properties
units = surface_temperature.units

print()
print("Attribute:\n", units)

# Variables can be indexed numpy-style or by special constraint objects
data = surface_temperature[0, 0:10, 0:10]
data = surface_temperature.extract(iris.Constraint(latitude=lambda x: 0 < x < 20))

# Data can be saved to a new file
iris.save(data, 'data.nc')
/local/swales/conda/analysis3/lib/python3.6/site-packages/iris/fileformats/cf.py:798: UserWarning: Missing CF-netCDF measure variable 'areacella', referenced by netCDF variable 'tas'
  warnings.warn(message % (variable_name, nc_var_name))
/local/swales/conda/analysis3/lib/python3.6/site-packages/iris/fileformats/_pyke_rules/compiled_krb/fc_rules_cf_fc.py:1813: FutureWarning: Conversion of the second argument of issubdtype from `str` to `str` is deprecated. In future, it will be treated as `np.str_ == np.dtype(str).type`.
  if np.issubdtype(cf_var.dtype, np.str):
/local/swales/conda/analysis3/lib/python3.6/site-packages/iris/fileformats/_pyke_rules/compiled_krb/fc_rules_cf_fc.py:1813: FutureWarning: Conversion of the second argument of issubdtype from `str` to `str` is deprecated. In future, it will be treated as `np.str_ == np.dtype(str).type`.
  if np.issubdtype(cf_var.dtype, np.str):
Variable:
 air_temperature / (K)               (time: 360; latitude: 145; longitude: 192)
     Dimension coordinates:
          time                           x              -               -
          latitude                       -              x               -
          longitude                      -              -               x
     Scalar coordinates:
          height: 1.5 m
     Attributes:
          Conventions: CF-1.4
          DODS_EXTRA.Unlimited_Dimension: time
          associated_files: baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation gridspecFile: gridspec_atmos_fx_ACCESS1-0_amip_r0i0p0.nc...
          branch_time: 0.0
          cmor_version: 2.8.0
          contact: The ACCESS wiki: http://wiki.csiro.au/confluence/display/ACCESS/Home. Contact...
          creation_date: 2012-02-17T05:21:53Z
          experiment: AMIP
          experiment_id: amip
          forcing: GHG, Oz, SA, Sl, Vl, BC, OC, (GHG = CO2, N2O, CH4, CFC11, CFC12, CFC113,...
          frequency: mon
          history: 2012-02-17T05:21:51Z altered by CMOR: Treated scalar dimension: 'height'....
          initialization_method: 1
          institute_id: CSIRO-BOM
          institution: CSIRO (Commonwealth Scientific and Industrial Research Organisation, Australia),...
          model_id: ACCESS1-0
          modeling_realm: atmos
          parent_experiment: N/A
          parent_experiment_id: N/A
          parent_experiment_rip: r1i1p1
          physics_version: 1
          product: output
          project_id: CMIP5
          realization: 1
          references: See http://wiki.csiro.au/confluence/display/ACCESS/ACCESS+Publications
          source: ACCESS1-0 2011. Atmosphere: AGCM v1.0 (N96 grid-point, 1.875 degrees EW...
          table_id: Table Amon (01 February 2012) 01388cb4507c2f05326b711b09604e7e
          title: ACCESS1-0 model output prepared for CMIP5 AMIP
          tracking_id: 7cfe11fc-5b1c-457d-812b-e95f45e7def4
          version_number: v20120115
     Cell methods:
          mean: time

Attribute:
 K
/local/swales/conda/analysis3/lib/python3.6/site-packages/iris/fileformats/netcdf.py:1573: FutureWarning: Conversion of the second argument of issubdtype from `str` to `str` is deprecated. In future, it will be treated as `np.str_ == np.dtype(str).type`.
  if np.issubdtype(coord.points.dtype, np.str):