Using OPeNDAP to access data remotely: MUR example

One of our researcher asked me recently to download the MUR (Multi-scale Ultra-high Resolution SST) dataset. She is interested in all the available period but only for a small region. This dataset is relatively small but has many files (several for each day across 19 years) and it is updated frequently. This means that we would also have to update and check the dataset frequently and the files would be stored across several sub-directories making the access more complicated.

Fortunately this data is available via OPeNDAP. OPeNDAP is a web-based software that allows users to access datasets remotely. Many softwares used for analysis recognise an OPeNDAP url as a filename. A OPeNDAP url is usually constituted by the remote address of the file followed by optional constraints.

This is one of the advantages of OPeNDAP you don't need to download a file before using it, you can simply subset the portion you need and the software you are using will load only the data you need. Next time you run the same analysis, if the data has been updated, you will be automatically using the updated dataset.

OPeNDAP url

Let's check an example using a test server:

http://test.opendap.org:80/opendap/data/nc/sst.mnmean.nc.gz.html
If you copy and paste the above url in your browser you will see what an OPeNDAP form looks like.

Let's split this url:
test.opendap.org:80/opendap/data
is the root of the opendap catalogue, starting from this url you can browse down the available subdirectories, in our case /nc/ indicating netcdf files;
finally the filename
sst.mnmean.nc.gz
Note in this example the file is compressed with gzip, opendap can access compressed files without needing you to download and uncompress them before. If you want to select only some variables you can do so by adding some constraints:
?sst,lat
The constraints syntax is a question mark followed by a list of variables. Each variable is separated by a comma and can be indexed, for example http://test.opendap.org:80/opendap/data/nc/sst.mnmean.nc.gz?sst[10:2:18][10:1:28][100:1:120]
will return a subset of the sst array with:

  • only every 2 timesteps from index 10 to 18
  • lat from index 10 to 28 included
  • lon from index 100 to 120 included

An easy way to build the url is to use the form to select what you want the data_url box will update itself and show you the url you need to use to get exactly what you selected.

You don't need to subset a variable or even specify any of them. It is useful when you want to select only a specific variable, region or time range.
The downside is that usually you have to first retrieve the dimensions to work out which indexes to use.
We will see now how using xarray and python can help you skip this step.

Accessing OPeNDAP in python with xarray

I am using xarray to open one file from the MUR dataset, load the data and select the time and lat/lon ranges.

In [1]:
import xarray as xa

If I knew exactly which indexes I'm interested into I could add a constraints to data url below and get back only a subset of the dataset.
Since we are using xarray we don't have to worry about that, since xarray initially will load only the information on the data and not the values.

In [2]:
dap_url="https://podaac-opendap.jpl.nasa.gov/opendap/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/152/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc"
data = xa.open_dataset(dap_url)

I can select the sst variable and a specific region using latitude and longitude values simply as I would after loading data from a netcdf file. In fact xarray showed me the variable names and dimensions after I "connected" to the remote file.
In this way I'm loading directly only the values I'm interested into.

In [3]:
sst=data['analysed_sst'].sel(lat=slice(-53.99,-14), lon=slice(140,170))
sst
Out[3]:
<xarray.DataArray 'analysed_sst' (time: 1, lat: 3999, lon: 3001)>
[12000999 values with dtype=float32]
Coordinates:
  * time     (time) datetime64[ns] 2002-06-01T09:00:00
  * lat      (lat) float32 -53.98 -53.97 -53.96 -53.95 ... -14.02 -14.01 -14.0
  * lon      (lon) float32 140.0 140.01 140.02 140.03 ... 169.98 169.99 170.0
Attributes:
    long_name:      analysed sea surface temperature
    standard_name:  sea_surface_foundation_temperature
    units:          kelvin
    valid_min:      -32767
    valid_max:      32767
    comment:        "Final" version using Multi-Resolution Variational Analys...
    source:         AMSRE-REMSS, AVHRR_Pathfinder-PFV5.2-NODC_day, AVHRR_Path...

Aggregated virtual files

Another powerful feature of OPeNDAP is that it works also with virtually aggregated datasets. This sounds complicated but all you need to know is that a multi files dataset can be made visible as a single file, you can then access potentially thousands of files via a single url.

MUR dataset is available as a virtually aggregated file so we can use this version of the data to get the complete SST timeseries from one url only.

In [4]:
aggr_url = "https://thredds.jpl.nasa.gov/thredds/dodsC/OceanTemperature/MUR-JPL-L4-GLOB-v4.1.nc"
data = xa.open_dataset(aggr_url)

I loaded the data in the same way and I'm going to select sst and the region IO'm interested into in exactly the same way

In [5]:
sst=data['analysed_sst'].sel(lat=slice(-53.99,-14), lon=slice(140,170))
sst
Out[5]:
<xarray.DataArray 'analysed_sst' (time: 6106, lat: 3999, lon: 3001)>
[73278099894 values with dtype=float32]
Coordinates:
  * lat      (lat) float32 -53.98 -53.97 -53.96 -53.95 ... -14.02 -14.01 -14.0
  * lon      (lon) float32 140.0 140.01 140.02 140.03 ... 169.98 169.99 170.0
  * time     (time) datetime64[ns] 2002-06-01T09:00:00 ... 2019-02-17T09:00:00
Attributes:
    long_name:      analysed sea surface temperature
    standard_name:  sea_surface_foundation_temperature
    units:          kelvin
    valid_min:      -32767
    valid_max:      32767
    comment:        "Final" version using Multi-Resolution Variational Analys...
    source:         AVHRR18_G-NAVO, AVHRR19_G-NAVO, AVHRR_METOP_A-EUMETSAT, M...
    _ChunkSizes:    [   1 1023 2047]