An example of reading non-standard data using pandas and xarray

An example of reading non-standard data using pandas and xarray#

It is very convenient when data is available in a format like netCDF, because it can be easily loaded using xarray.open_dataset, particularly when there is also sufficient metadata so the data can be identified and used correctly.

Sometimes data is only available in a non-standard form which takes some work to put into a convenient form like an xarray.Dataset. The purpose of this blog post is step through an example of the process of importing just such a dataset, transforming it into an easy to use xarray.Dataset, and then saving it as a netCDF formatted file for easy and convenient re-use.

Importing the Eastern Australia and New Zealand Drought Atlas (ANZDA) from NCEI, NOAA#

About the data: The data is a reconstruction of the Palmer drought sensitivity index across eastern Australia, and is the first Southern Hemisphere gridded drought atlas extending back to CE 1500.

The data is available from this NOAA website:

https://www.ncei.noaa.gov/access/paleo-search/study/20245

which explains the provenance of the data, has extensive meta-data in a number of formats, and a series of data files available for download

The data and grid information are stored in seprate files. Looking first at the grid, it is available in a text file (anzda-pdsi-xy.txt) as lat-lon pairs that correspond to each grid cell

The file is tab-delimited, and formatted into columns, the first column is longitude, the second latitude

25  -46.75  697     87      scPDSI.dat.
25  -46.25  699     88      scPDSI.dat.
75  -46.25  700     88      scPDSI.dat.
75  -45.75  696     89      scPDSI.dat.
25  -45.75  697     89      scPDSI.dat.
75  -45.75  698     89      scPDSI.dat.
25  -45.75  699     89      scPDSI.dat.
75  -45.75  700     89      scPDSI.dat.

As the data files are tab-separated columnar text files the pandas library is a good choice for reading in the data and manipulating it. pandas is specifically designed for manipulating tabular data, and much of the machinery of xarray is based on pandas, but extended to n-dimensions.

from datetime import datetime
import urllib

import numpy as np
import pandas as pd
import xarray as xr

Use the pandas.read_csv function to read the first two columns directly from the urlopen function, with the web address of the data file as a it’s argument, returning a pandas.Dataframe

Note that the delimiter is specifed to be be tab (\t), no header line is specified and the columns are assigned useful names.

url = "https://www.ncei.noaa.gov/pub/data/paleo/treering/reconstructions/australia/palmer2015pdsi/anzda-pdsi-xy.txt"
xy = pd.read_csv(urllib.request.urlopen(url), 
                 delimiter='\t', 
                 header = None,
                 names = ('lon', 'lat'),
                 usecols = [0, 1])

The head method is a useful way to get a small sample of the data to check it

xy.head()

	lon	lat
0	168.25	-46.75
1	169.25	-46.25
2	169.75	-46.25
3	167.75	-45.75
4	168.25	-45.75

Plotting is also a great way to quickly check the data. Like xarray, pandas has a built in plot method that calls out to another library to do the plotting. In this case the default is matplotlib.

xy.plot(x='lon', y='lat', kind='scatter');

../_images/4a2c77f6382b4853b353c997cd095318acf5585ce3fe60fdff666cbd0a1e1fe0.png

Now get the actual data: it is also tab-separated columnar ASCII format, according to the dataset README. That is, one reconstruction per column, with the grid cell number as the column header, and rows are by year with the first column the year index.

As before use the pandas.read_csv function to read the data in, specifying that the delimiter is a tab and that the first column should be treated as the row index. In this case the first line contains useful header information, so that is retained.

url = "https://www.ncei.noaa.gov/pub/data/paleo/treering/reconstructions/australia/palmer2015pdsi/anzda-recon.txt"
data = pd.read_csv(urllib.request.urlopen(url), delimiter='\t', index_col=0)
data

	1	2	3	4	5	6	7	8	9	10	...	1366	1367	1368	1369	1370	1371	1372	1373	1374	1375
Year
1500	0.047	0.192	0.694	0.617	1.331	0.758	1.105	0.320	0.487	0.919	...	0.149	-2.415	-2.645	-1.689	-0.309	-1.834	-0.861	0.409	0.294	-0.922
1501	-1.992	-1.599	-1.086	-1.904	-2.338	-1.585	-1.907	-2.004	-1.522	-1.424	...	0.558	-3.001	-2.530	-1.298	-0.397	0.203	-1.984	0.582	0.919	-0.882
1502	-0.551	-0.491	-0.333	-0.901	-0.550	-1.249	-1.408	-0.623	-0.296	-0.840	...	-0.860	-2.371	-3.420	-3.060	-2.128	-2.107	-0.905	-1.918	-0.450	-1.893
1503	-2.560	-2.793	-1.539	-2.449	-2.696	-2.232	-2.962	-2.576	-1.995	-0.980	...	0.531	-0.522	0.442	0.608	-1.790	0.352	-0.073	-1.621	1.114	-2.080
1504	-1.729	-1.835	-1.514	-1.830	-1.867	-1.225	-1.918	-2.063	-1.577	-0.832	...	-0.739	-2.633	-0.383	-0.389	-2.514	-1.226	-1.999	-2.465	-0.348	-2.525
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2008	-1.058	-0.663	-0.271	-1.061	-1.050	-0.613	-0.840	-0.549	-0.103	-0.461	...	2.295	1.619	-2.157	-1.629	1.634	1.566	2.204	1.155	1.769	2.149
2009	-0.072	0.554	0.662	0.923	1.133	0.621	0.410	0.438	1.037	0.895	...	2.048	0.789	-0.924	-0.589	1.424	0.920	1.550	0.982	2.232	1.643
2010	-0.488	0.077	0.295	0.006	1.873	0.237	0.083	0.114	0.400	0.299	...	0.505	-0.491	1.526	1.644	-0.016	-0.050	0.557	-0.255	1.056	0.949
2011	0.015	0.510	1.889	0.489	1.893	0.644	0.613	1.871	1.744	0.935	...	4.338	4.343	2.736	2.717	3.498	3.673	4.027	2.872	3.792	3.163
2012	0.218	0.490	0.519	0.506	0.766	0.614	0.503	0.391	0.344	1.219	...	3.412	2.993	1.799	2.064	2.841	2.936	3.320	2.085	3.422	2.846

513 rows × 1375 columns

The data is stored as a matrix with associated time and location vectors, so we need to convert from a “wide” to “long” format using the melt method, with the result that there is a row for every unique combination of year and location.

Specifying ignore_index=False means it isn’t included in the value column, but is retained as an index. This method will create two new columns with the cell index in one, and the values from the table in other. The function has default names variable and value, but better to give them useful names: cellref and pdsi respectively.

data_long = data.melt(ignore_index=False, var_name = 'cellref', value_name='pdsi')
data_long

	cellref	pdsi
Year
1500	1	0.047
1501	1	-1.992
1502	1	-0.551
1503	1	-2.560
1504	1	-1.729
...	...	...
2008	1375	2.149
2009	1375	1.643
2010	1375	0.949
2011	1375	3.163
2012	1375	2.846

705375 rows × 2 columns

Next is to add the lat and lon values for each cell index to the data table above. One way to achieve this is to add a cellref column to the xy table so it can be merged with data_long using the cellref variable as a key

xy['cellref'] = range(1, len(xy)+1, 1)
xy

	lon	lat	cellref
0	168.25	-46.75	1
1	169.25	-46.25	2
2	169.75	-46.25	3
3	167.75	-45.75	4
4	168.25	-45.75	5
...	...	...	...
1370	142.75	-12.25	1371
1371	143.25	-12.25	1372
1372	142.25	-11.75	1373
1373	142.75	-11.75	1374
1374	142.75	-11.25	1375

1375 rows × 3 columns

Now try merge

data_long.merge(xy, on='cellref')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [8], in <cell line: 1>()
----> 1 data_long.merge(xy, on='cellref')

File /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.01/lib/python3.9/site-packages/pandas/core/frame.py:9190, in DataFrame.merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
@Substitution("")
@Appender(_merge_doc, indents=2)
def merge(
   (...)
   validate: str | None = None,
) -> DataFrame:
   from pandas.core.reshape.merge import merge
-> 9190     return merge(
       self,
       right,
       how=how,
       on=on,
       left_on=left_on,
       right_on=right_on,
       left_index=left_index,
       right_index=right_index,
       sort=sort,
       suffixes=suffixes,
       copy=copy,
       indicator=indicator,
       validate=validate,
   )

File /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.01/lib/python3.9/site-packages/pandas/core/reshape/merge.py:106, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
@Substitution("\nleft : DataFrame or named Series")
@Appender(_merge_doc, indents=0)
def merge(
   (...)
   validate: str | None = None,
) -> DataFrame:
--> 106     op = _MergeOperation(
       left,
       right,
       how=how,
       on=on,
       left_on=left_on,
       right_on=right_on,
       left_index=left_index,
       right_index=right_index,
       sort=sort,
       suffixes=suffixes,
       copy=copy,
       indicator=indicator,
       validate=validate,
   )
   return op.get_result()

File /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.01/lib/python3.9/site-packages/pandas/core/reshape/merge.py:703, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
(
   self.left_join_keys,
   self.right_join_keys,
   self.join_names,
) = self._get_merge_keys()
# validate the merge keys dtypes. We may need to coerce
# to avoid incompatible dtypes
--> 703 self._maybe_coerce_merge_keys()
# If argument passed to validate,
# check if columns specified as unique
# are in fact unique.
if validate is not None:

File /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.01/lib/python3.9/site-packages/pandas/core/reshape/merge.py:1256, in _MergeOperation._maybe_coerce_merge_keys(self)
   # unless we are merging non-string-like with string-like
   elif (
       inferred_left in string_types and inferred_right not in string_types
   ) or (
       inferred_right in string_types and inferred_left not in string_types
   ):
-> 1256         raise ValueError(msg)
# datetimelikes must match exactly
elif needs_i8_conversion(lk.dtype) and not needs_i8_conversion(rk.dtype):

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

The merge fails with an error message

You are trying to merge on object and int64 columns.

Inspecting the types (dtypes) of the two tables shows the issue, the cellref column in data_long has a type object

xy.dtypes

lon        float64
lat        float64
cellref      int64
dtype: object

data_long.dtypes

cellref     object
pdsi       float64
dtype: object

The solution is to replace the cellref column with the same data converted to a number, and try merging again.

data_long['cellref'] = pd.to_numeric(data_long['cellref'])
data_long

	cellref	pdsi
Year
1500	1	0.047
1501	1	-1.992
1502	1	-0.551
1503	1	-2.560
1504	1	-1.729
...	...	...
2008	1375	2.149
2009	1375	1.643
2010	1375	0.949
2011	1375	3.163
2012	1375	2.846

705375 rows × 2 columns

Note that reset_index converts the Year index into a normal column, this is necessary to retain the Year as indices are discarded when a column is used as the value to merge with as is done here, i.e. on='cellref'

data_long = data_long.reset_index().merge(xy, on='cellref')
data_long

	Year	cellref	pdsi	lon	lat
0	1500	1	0.047	168.25	-46.75
1	1501	1	-1.992	168.25	-46.75
2	1502	1	-0.551	168.25	-46.75
3	1503	1	-2.560	168.25	-46.75
4	1504	1	-1.729	168.25	-46.75
...	...	...	...	...	...
705370	2008	1375	2.149	142.75	-11.25
705371	2009	1375	1.643	142.75	-11.25
705372	2010	1375	0.949	142.75	-11.25
705373	2011	1375	3.163	142.75	-11.25
705374	2012	1375	2.846	142.75	-11.25

705375 rows × 5 columns

The resulting dataframe has a value of pdsi for every year at every location, with an associated lat and lon value

An example of reading non-standard data using pandas and xarray

Contents

An example of reading non-standard data using pandas and xarray#

Importing the Eastern Australia and New Zealand Drought Atlas (ANZDA) from NCEI, NOAA#

Convert the dataframe to xarray#

Save data for reuse#