How to create NetCDF files

This is a very basic introduction into NetCDF files and how to create them.

This document describes how to create a NetCDF file in Fortran, and later on in Python. The methods described herein will overwrite any existing file with the same name. Opening existing NetCDF files, either for reading or for modifying/appending, is different.

The exact way data is stored in NetCDF Format is not necessary to know, but what you need to know is that each file consists of a header, which contains the meta-data, and the actual data.

Metadata is Data about Data. It tells us the dimensions and datatypes of the data, as well as arbitrary attributes.

Since the header is at the beginning of the file, and can change its length when new data is added or removed, it is very advisable to first create the header before starting to write the data.

Create a NetCDF file in Fortran

netcdf module

While there is an old-style Fortran 77 NetCDF interface, I strongly recommend using the Fortran 90 module invoked with the use statement:

program create_netcdf
    use netcdf
    implicit none
end program create_netcdf

This is the most basic structure. To compile this program, you need the netcdf library in the LD_LIBRARY_PATH, on raijin, this can be easily achieved by running the module load netcdf command (or any of the different versions).

Then you need to compile it with the compiler options -lnetcdf -lnetcdff:

$ module load intel-fc/2018.3.222 netcdf/4.6.1
$ ifort -o create_netcdf -lnetcdf -lnetcdff create_netcdf.f90

netcdf Fortran 90 API

The full API can be found here: https://www.unidata.ucar.edu/software/netcdf/netcdf-4/newdocs/netcdf-f90.html, but it's 10 years old and pretty out-of-date.

Better to use the C API here: https://www.unidata.ucar.edu/software/netcdf/docs/modules.html, always replacing the leading nc_ with nf90_.

But the simple version is: Every instruction is a call to a function staring with nf90_, which returns an integer value as a status. Status 0 (or NF90_NOERR) means that the call was successful.

I strongly suggest writing a little check routine, like this:

subroutine check(status, operation)
    use netcdf
    implicit none
    integer, intent(in) :: status
    character(len=*), intent(in) :: operation
    if (status == NF90_NOERR) return
    print *, "Error encountered during ", operation
    print *, nf90_strerror(status)
    STOP 1
end subroutine check

Then, after every call to any nf90_ routine, you can call this check to see what happened. If all went fine, the call returns immediately, but if there was an error, you get a human-readable error output.

create an empty file

To create a file, we need a file handle that is an integer variable, by convention ncid, which references the file we created similar to the unit of read and write commands:

program create_netcdf
    use netcdf
    implicit none
    integer :: status, ncid

    status = nf90_create('data.nc', NF90_NETCDF4, ncid)
    call check(status, 'open')

    status = nf90_close(ncid)
    call check(status, 'close')

contains

    subroutine check(status, operation)
        ....
    end subroutine check
end program create_netcdf

The NF90_NETCDF4 tells the NetCDF library which type of NetCDF file to create. This is the most recent version.

For the rest of the documentation I will no longer write the calls to check, please assume to do that after every call to any nf90_ function

a very basic header

I now assume that I want to store a 2-d field: latitude by longitude. ny and nx are the number of gridpoints in latitude and longitude respectively. The field itself I simply call field, and lat_array and lon_array are arrays with the values for latitude and longitude.

the dimensions

First, we need the dimensions. Fundamentally, a dimension is just a name, and a length. For each dimension, we need an integer variable called a dimension id, or dimid. Assume we declared dimids for latitude and longitude called dimid_lat and dimid_lon:

    status = nf90_def_dim(ncid, 'longitude', nx, dimid_lon)
    status = nf90_def_dim(ncid, 'latitude', ny, dimid_lat)

We're only using fixed-length dimensions here, but if we wanted am unlimited dimension, we'd write NF90_UNLIMITED as the length of the dimensions (where we currently have nx and ny, respectively).

the variables

Next, we define 3 more integer variables, varid_*, one each for longitude and latitude (those will contain the actual values for the latitude and longitudes, and of course our field.

    status = nf90_def_var(ncid, 'longitude', NF90_FLOAT, [dimid_lon], varid_lon)
    status = nf90_def_var(ncid, 'latitude', NF90_FLOAT, [dimid_lat], varid_lat)
    status = nf90_def_var(ncid, 'field', NF90_FLOAT, [dimid_lon, dimid_lat], varid_field)

You can see that each variable needs a name, a type (in this case 32-bit floating points), an array of dimensions (referenced by the dimension IDs), and a new integer variable which references the variable itself.

compression

The field in our example is fairly small, but if it were larger, we would want to chunk and compress it. You can read about what chunking does here: https://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_why_it_matters

But here's how we do that in the Fortran Code: After we have defined a variable, we add this line:

    status = nf90_def_var_chunking(ncid, varid_field, NF90_CHUNKED, [10, 101])

NF90_CHUNKED -- as opposed to NF90_CONTIGUOUS, tells that the variable shall be chunked. We can also use an array to declare how the data should be chunked specifically.

Once the data is declared as chunked, we can compress -- or deflate it:

    status = nf90_def_var_deflate(ncid, varid_field,          &
                                  shuffle = 1,                &
                                  deflate = 1,                &
                                  deflate_level = 5  )

Here we have a noticeable point, both shuffle and deflate technically only need a boolean, but for reasons, you have to give them an integer: 0 for false, any other number for true.

shuffle isn't all that important, but you might as well use it. deflate tells the netCDF library that the variable should be compressed. The compression level can be set between 0 and 9. 5 is usually a good compromise.

the attributes

Attributes can be attached to a variable, or to the file itself (global attribute). Common attributes are the units, long names, et cetera.

    status = nf90_put_att(ncid, NF90_GLOBAL, 'note', 'training file created with Fortran 90')
    status = nf90_put_att(ncid, varid_lon, 'units', 'degree_east')
    status = nf90_put_att(ncid, varid_lat, 'units', 'degree_north')
    status = nf90_put_att(ncid, varid_field, '_FillValue', -2e8)

We can add any number of attributes to each variable, or even to the global file. They can be of any common type. For a list of attribute conventions, you can look here: http://cfconventions.org/

end definition

With NetCDF versions before version 4, we needed to tell the Fortran Module that we're finished with the header by calling nf90_enddef, but with NetCDF4 this is no longer necessary. Note that if you use older versions of NetCDF (declared in the nf90_create instruction above, you would have to make a call to nf90_enddef here.

The actual data

Let's create some data. I'm using implicit do loops to create even-spaced latitude and longitude arrays, and the field variable gets created in a do-loop:

    lat_array = [(jj * (360./ny), jj=0, ny-1)]
    lon_array = [((ii * (180./(nx-1)) - 90.), ii=0, nx-1)]
    do jj = 1, ny
        do ii = 1, nx
            field(ii, jj) = sin(lon_array(ii) * pi/180.) * &
                cos(lat_array(jj) * pi/180.)
        end do
    end do

writing the data

It is worth noting that the variable definition of the field has the dimensions in the same order as the actual 2-d array. This makes things very easy:

    status = nf90_put_var(ncid, varid_lon, lon_array)
    status = nf90_put_var(ncid, varid_lat, lat_array)
    status = nf90_put_var(ncid, varid_field, field)

Result

Looking at the output file, we notice something:

$ ncdump -hs data.nc
netcdf data {
dimensions:
    longitude = 200 ;
    latitude = 101 ;
variables:
    float longitude(longitude) ;
        longitude:units = "degree_east" ;
        longitude:_Storage = "contiguous" ;
        longitude:_Endianness = "little" ;
    float latitude(latitude) ;
        latitude:units = "degree_north" ;
        latitude:_Storage = "contiguous" ;
        latitude:_Endianness = "little" ;
    float field(latitude, longitude) ;
        field:_FillValue = -2.e+08f ;
        field:_Storage = "chunked" ;
        field:_ChunkSizes = 101, 10 ;
        field:_DeflateLevel = 5 ;
        field:_Shuffle = "true" ;
        field:_Endianness = "little" ;

// global attributes:
        :note = "training file created with Fortran 90" ;
        :_NCProperties = "version=1|netcdflibversion=4.6.1|hdf5libversion=1.10.2" ;
        :_SuperblockVersion = 0 ;
        :_IsNetcdf4 = 1 ;
        :_Format = "netCDF-4" ;
}

In Fortran, we always had the dimensions of field as (longitude, latitude), now it's the other way round.

That's because Fortran stores multi-dimensional arrays the other way as almost all other programming languages.

Python

There are different methods to read and write NetCDF in Python, but xarray is one of the most convenient. It can store the complete datastructure in memory, the output is then only a single instruction:

First, import xarray and numpy

In [1]:
import xarray as xr
import numpy as np
%matplotlib inline

Second: Create the latitude and longitude arrays:

In [2]:
nx = 200; ny = 101
lon_array = np.linspace(0, 360, nx, endpoint=False, dtype=np.float32)
lat_array = np.linspace(-90, 90, ny, endpoint=True, dtype=np.float32)

Third: Create the field, complete with values, dimensions, coordinates, and attributes.

In [3]:
field=xr.DataArray(
    np.sin(lon_array[np.newaxis, :] * np.pi / 180.) * 
    np.cos(lat_array[:, np.newaxis] * np.pi / 180.),
    dims = ['latitude', 'longitude'], 
    coords = {'latitude': lat_array, 'longitude': lon_array},
    attrs = {'_FillValue':-2e8},
)

Fourth: The coordinate arrays don't have units yet, let's fix that.

In [4]:
field.longitude.attrs['units'] = 'degree_east'
field.latitude.attrs['units'] = 'degree_north'

Fifth: Create a Dataset containing all (in this case only one) fields. The netcdf file attributes are taken from the dataset attributes:

In [5]:
ds = xr.Dataset({'field':field}, attrs={'note':'training file created with xarray'})

Sixth: Use a single instruction to store all the data:

In [6]:
ds.to_netcdf('data.nc', format='NETCDF4', 
             encoding={'field':{
                                'shuffle':True,
                                'chunksizes':[101, 10],
                                'zlib':True,
                                'complevel':5
            }})
In [7]:
!ls data.nc
data.nc
In [8]:
!ncdump -hs data.nc
netcdf data {
dimensions:
	longitude = 200 ;
	latitude = 101 ;
variables:
	float longitude(longitude) ;
		longitude:_FillValue = NaNf ;
		longitude:units = "degree_east" ;
		longitude:_Storage = "contiguous" ;
		longitude:_Endianness = "little" ;
	float latitude(latitude) ;
		latitude:_FillValue = NaNf ;
		latitude:units = "degree_north" ;
		latitude:_Storage = "contiguous" ;
		latitude:_Endianness = "little" ;
	float field(latitude, longitude) ;
		field:_FillValue = -2.e+08f ;
		field:_Storage = "chunked" ;
		field:_ChunkSizes = 101, 10 ;
		field:_DeflateLevel = 5 ;
		field:_Shuffle = "true" ;
		field:_Endianness = "little" ;

// global attributes:
		:note = "training file created with xarray" ;
		:_NCProperties = "version=1|netcdflibversion=4.6.1|hdf5libversion=1.10.1" ;
		:_SuperblockVersion = 0 ;
		:_IsNetcdf4 = 1 ;
		:_Format = "netCDF-4" ;
}