How and when to use PBS Jobfs on NCI HPC systems#

What is PBS Jobfs?#

PBS Jobfs is yet another file system on NCI’s HPC systems, found at the path /jobfs. Unlike /home, /scratch or /g/data, /jobfs can only be utilised inside of a PBS job, and its contents only last as long as the job does. Jobfs is requested like any other PBS resource, using the -ljobfs=xxxGB PBS flag. Unlike CPU and memory usage, it does not affect how many SU’s your job will cost. The contents of /jobfs do not effect quota usage on any other NCI HPC file systems.

In contrast to most file systems on NCI HPC systems, /jobfs is not a global file system, meaning that its contents are only visible on a single node. This is important to consider when running jobs across multiple nodes - /jobfs on each node is different, and files need to be copied in/out on a per-node basis.

How to use PBS Jobfs#

Every PBS job gets a distinct PBS Jobfs directory on every node on which it runs at the path /jobfs/<pbs jobid>. These directories exist only for the length of time the PBS job is running. It is recommended to use the $PBS_JOBFS environment variable in PBS scripts, as this is always set to the correct path to your job’s specific /jobfs directory. Every PBS job gets a default allocation of 100MB of jobfs, and your workflows may already be using PBS Jobfs without directly specifying it.

The canonical linux environment variable $TMPDIR is set to the value of $PBS_JOBFS in every PBS job. Most applications on linux use this environment variable to determine where temporary files should be placed, though not all do. For example, versions of Dask found in the hh5 conda analysis environments prior to 22.07 place their temporary files in the current working directory. You may have had a job die with the message Job 12345678.gadi-pbs killed due to exceeding jobfs quota without ever having requested or used /jobfs, this is an indication that your application is placing a large amount of data in $TMPDIR, which is the correct place for it. All you need to do in this case is increase your PBS -ljobfs= request.

Any files that do not need to persist beyond the length of a PBS job should use PBS Jobfs. The main benefit of doing this, aside from performance, is that PBS jobfs is automatically cleared after every PBS job. This means that any badly-behaving applications that do not clean their own temporary files will not contribute to exhausting your /home, /scratch or /g/data quotas.

Using PBS Jobfs in various programming languages#

Applications that do not respect the $TMPDIR environment variable need to be told to use /jobfs directly. Importantly, the $PBS_JOBFS variable is not set on login nodes, so any code that attempts to use that variable should be robust enough to correctly handle the case where that variable does not exist. It is also possible for $TMPDIR to be unset, so the code should also fall back to a fixed location when that variable is not available. Below are some examples of the right way to do this in a few different programming languages:

Bash - setting an environment variable. This example assumes your application reads a specific environment variable to determine its temporary file location:

if [[ "${PBS_JOBFS}" ]]; then
    export MY_APP_TEMP_FILE_LOCATION="${PBS_JOBFS}"
elif [[ "${TMPDIR}" ]]; then
    export MY_APP_TEMP_FILE_LOCATION="${PBS_JOBFS}"
else
    export MY_APP_TEMP_FILE_LOCATION=/tmp
if

./myapp

Bash - writing a configuration file. This example assumes you have a config file that your application will read in order to determine where it should place its temporary files.

$ grep temp_directory my_application_config.conf.template
temp_directory __MY_APP_TEMP_DIRECTORY__

if [[ "${PBS_JOBFS}" ]]; then
    tflocation="${PBS_JOBFS}"
elif [[ "${TMPDIR}" ]]; then
    tflocation="${PBS_JOBFS}"
else
    tflocation=/tmp
if

sed 's:__MY_APP_TEMP_DIRECTORY__:'${tflocation}':' < my_application_config.conf.template > my_application_config.conf

$ grep temp_directory my_application_config.conf
temp_directory /jobfs/12345678.gadi-pbs

./myapp -f my_application_config.conf

Python/Jupyter notebook - as an argument to a function:

from dask.distributed import Client
import os

worker_dir=os.getenv('PBS_JOBFS')
if not worker_dir:
    worker_dir=os.getenv('TMPDIR')
if not worker_dir:
    worker_dir="/tmp"

client = Client(local_dir=worker_dir)

Python/Jupyter notebook - setting an environment variable:

import os

worker_dir=os.getenv('PBS_JOBFS')
if not worker_dir:
    worker_dir=os.getenv('TMPDIR')
if not worker_dir:
    worker_dir="/tmp"

os.environ['MY_APP_TEMP_FILE_LOCATION']=worker_dir

C/C++/Cuda:

#include <stdio.h>
#include <stdlib.h>

char *my_app_temp_file_location = NULL;
if ( getenv("PBS_JOBFS") ) {
    my_app_temp_file_location = getenv("PBS_JOBFS");
} elif ( getenv("TMPDIR") ) {
    my_app_temp_file_location = getenv("TMPDIR");
} else {
    my_app_temp_file_location = "/tmp";
}

When to use PBS Jobfs#

File system performance basics#

There are two measures of file system performance, bandwidth and Input/output OPerations per Second (IOPS). A single IOP is any operation on a file, e.g. an open, close, read, write, seek etc. Low level operations like this are often obscured from higher level applications, so it may be difficult to determine how many of these IOPs your application is performing. Bandwidth is fairly straightforward, its the amount of data that can be written to the file system under ideal conditions per unit time. In terms of file systems, more is better, more bandwidth and more IOPS means a faster file system. Design constraints mean that different file systems are better suited to different kinds of operations.

The global file systems are designed first for high bandwidth, then capacity. They are comprised of several hundred spinning hard disk drives, as, at time of writing, the quantity of solid state drives required to meet NCI’s performance targets is cost prohibitive. Due to the physical constraints of spinning drives and the complexity of global file systems, the entire /scratch file system on Gadi is capable of somewhere around one million IOPS. While this may sound like a lot, this performance is shared with every user on every login, compute and data mover node. It is possible for a parallel application on a single node to perform around one million IOPs, exhausting the ability of the file system. If you’ve ever noticed an NCI HPC system going slowly and hanging when running commands like ls, it is very likely that something like this was happening at the time.

This is where PBS Jobfs comes in. Each node has a 480GB solid state drive, individually capable of around 300,000 IOPS. Though not as much as the global file systems, /jobfs is not shared outside of the jobs running on a single node. This means that very few applications will not benefit from having small intermediate files written to /jobfs.

What kinds of workflows can benefit from this?#

Any workflow that creates a large number of small files (~10’s of kB) will see improved performance from moving onto /jobfs. In some cases, if your application output is many small files it may be worth having your application output to /jobfs, then copy the data across at the end of the job. For example, an application that writes output in the current working directory:

cd "${PBS_JOBFS}"
"${PBS_O_WORKDIR}"/myapp -f "${PBS_O_WORKDIR}"/myapp.conf
tar -cf "${PBS_O_WORKDIR}"/output.tar *

What about multiple node jobs?#

In PBS jobs, commands in the job script are only executed on the first node of the job. This means that if files need to be copied into our out of jobfs on multiple nodes, PBS must know that these staging commands need to be performed on multiple nodes. PBS provides commands that can do this. For example, an MPI job that reads a file from /jobfs on all processes, and writes per-process output to /jobfs.

#PBS -l ncpus=96
#PBS -l mem=190GB
#PBS -l walltime=1:00:00
#PBS -l storage=gdata/ab12+scratch/cd34
#PBS -l jobfs=200GB
#PBS -l wd

### Stage files to jobfs
for node in $( uniq < "${PBS_NODEFILE}" ); do
    pbs_tmrsh "${node}" mkdir "${PBS_JOBFS}"/input "${PBS_JOBFS}"/output
    pbs_tmrsh "${node}" cp ./input_files/* "${PBS_JOBFS}"/input &
done

wait

module load openmpi/4.1.4
mpirun ./myapp -i "${PBS_JOBFS}"/input -o "${PBS_JOBFS}"/output

### Stage output files
mkdir output_files
for node in $( uniq < "${PBS_NODEFILE}" ); do
    ### Make separate local directories for each node as there may
    ### be naming conflicts
    mkdir output_files/"${node}"
    pbs_tmrsh "${node}" cp "${PBS_JOBFS}"/output/* output_files/"${node}"/ &
done

wait

Note that you do not need to use scp or rsync when copying to and from the remote nodes, as every node in a job can read from and write to the global file systems. The use of & after the pbs_tmrsh commands and wait allows the copies to happen in parallel, which can provide a speedup when copying to and from /scratch or /g/data.

What if jobfs isn’t big enough?#

NCI provides a second job-level filesystem that can be attached to PBS Jobs on request, called ‘IO Intensive’. If your jobs are using more than the maximum jobfs request of 440GB per node, you can request the -liointensive=n PBS resource, where n is a multiple of the number of nodes of the job. A request of -liointensive=n gives n-TB of node-local storage distributed evenly across a job. Unlike /jobfs, which has job-specific paths, the path to the IO intensive file system is always /iointensive. To use it, you will need to configure your jobs as above but using /iointensive in place of $PBS_JOBFS, and files will need to be staged in and out on a per-node bases. You may also elect to change the $TMPDIR variable to point to /iointensive as well. Note that IO Intensive is a limited resource, only request it if it is absolutely necessary.

Conclusion#

PBS Jobfs is a good resource to take advantage of when your workflow produces a large number of small files. In many cases your jobs will already be using it, but some applications do not respect canonical linux environment variables, and must be told to place their temporary files there explicitly. There are also cases where manually placing files in PBS Jobfs, or writing analysis or model output there and staging back to global filesystems can prove beneficial to performance. It also helps with quota management, as all files in jobfs are cleaned after the job finishes, meaning you won’t have temporary files from old model runs clogging up your /home or /g/data quota.