19 October 2018

Fast MOM collation with payu and mppnncombine-fast

by Aidan Heerdegen


MOM ocean outputs (and restarts) are usually tiled: the output is distributed over multiple files because it is faster. After the model is run the tile outputs are joined back together using a program called mppnccombine.

As model resolution has increased it has become necessary to use compressed netCDF4 outputs in order to save space. Compressed ocean data files are typically one half to one third the size of the equivalent uncompressed data.

The time taken to collate ouputs with higher resolution, particularly the 0.1 degree model, has become burdensome. In some cases collation is taking longer than the model run itself. A large part of the time taken to collate these files is due to uncompressing and recompressing the tiled output.

Scott Wales has written mppnccombine-fast which collates compressed netCDF4 data using functions in the HDF5 library (upon which netCDF4 is built) to directly copy compressed data without the extra steps of decompression and recompression.


  1. You will need a mppnccombine-fast executable. Either use one I have compiled /short/public/aph502/mppnccombine-fast or get the the source and compile your own version

  2. Place the executable in /short/$PROJECT/$model/bin or specify full path in config.yaml

  3. You must use a version of payu of 0.10 or greater (module load payu/0.10 on raijin)

  4. Updated config.yaml syntax

Old Syntax

collate: true
collate_mem: 16GB
collate_queue: express
collate_ncpus: 4
collate_flags: -n4 -r

New syntax

Replaces collate_ options with dictionary

    enable: true
    queue: express
    memory: 4GB
    walltime: 00:30:00
    mpi: true
    ncpus: 4
    threads: 2
    # flags: -v
    # exe: /full/path/to/mppnccombine-fast

You must specify mpi to use mppnccombine-fast. Minimum of 2 cpus, so you can’t use copyq. The number of cpus per thread is ncpus / nthreads.

nthreads defaults to 1. ncpus defaults to 2 and enable defaults to true.

Don’t have to specify flags, enable or exe unless you need to specify values other than the default.

There are fewer flags, as mppnccombine-fast has fewer command options than mppnccombine

Resource requirements

Memory use should only depend on chunksize in the compressed file, not on the overall size of the file being written. Unfortunately a memory leak bug in the underlying HDF5 library means memory use will go up with the number of times data is written to a collated file. It is difficult to predict, but 2-4GB per thread has been the upper limit observed so far.

Walltime should be minutes. There is no speed-up for low resolution outputs (MPI overhead swamps fast run times). Also no speed advantage for uncompressed files, especially if io_layout is not specified, and there are hundreds or thousands of tiles.

There is a minimum of 2 cpus. Speed improvements up to 8 cpus in some cases has been observed, but experimentation is required to optimise this for any particular configuration, as the size and number of diagnostics can vary a great deal.

Layout affects efficiency

Chunk sizes are chosen automatically by netCDF4 and depend on tile size. Inconsistent tile sizes leads to inconsistent chunk sizes. Inconsistent chunk sizes makes mppnccombine-fast slow as it has to do the uncompress/compress step for tiles which are not the same as the first

To ensure maximum speed make processor layout and io_layout, an integer divisor of grid.


Quarter degree MOM-SIS model: 1440 x 1080.

layout = 64, 30
io_layout = 8, 6

Improved Layout

Quarter degree MOM-SIS model: 1440 x 1080.

layout = 60, 36
io_layout = 10, 6