Xarray Loader for forecast data

Gist of it.

Forecast data is multi-dimensional, now that we have lovely zarr-format data it literally is structured into scalable sub-directories like this (in a simplified manner), although the same figuratively holds for any Hierarchical Data Format types (HDF, netCDF4):

.
├── level
   ├── isobaricinhPa
   :
   └── surface
             |
             variable
             ├── 2m Temperature
             :
             └── Total Precipitation
                                    |
                                    array
                                    └── (time,timestep,member,lat.,lon.)

Where for inference we need to match the forecast to the valid time i.e.,

time+timedelta(timestep)==>valid_time

Other than matching forecast to valid time, in training weather data we may want to format data in different ways. For example, if it is to simulate irregularly sampled or discontinuous data, it may be desirable to stochastically generate discontinuous paths.

Moreover to calculate graph networks we may want a NearestNeighbour search per batch based on the radial coordinates:

x = elev * np.cos(lat) * np.cos(lon)
y = elev * np.cos(lat) * np.sin(lon)
z = elev * np.sin(lat)

This package grew out of necessity to switch between different model configurations and input formats (i.e., discontinuous, graph or simply forecast data with varying number of input features).

Data loading flow

When loading in truth data, there is no need to stream/lazy load as the magnitude of data is low and optimisation within pytorch's DataLoader is enough.

However, it gets more complicated when we also want to load multiple forecast variables at the same time, instead of just truth data: it's just way more data to load at once! We plan to utilise pytorch's IterableDataset class to stream in batches, but there's a catch! What if we want dynamic sampling of data? One could of course create a sampler, however for more flexibility we plan the following dask-optimised route (Note: this unpacks what is under the hood in xr.open_mfdataset):

Commands e.g. for_NJ can be used to generate irregularly sampled fields or do greedy nearest neighbour searches and return edge features.

The following link: https://drive.google.com/file/d/116xsLEtRntjWOljMG4yE71_uelgIXoox/view?usp=sharing contains rainfall data from IMERG, sampled at a 30 minute frequency for the Greater horn of Africa region.

One can download and set it's path under the utils.py file:

TRUTH_PATH = (
    "../example_datasets/"
)

Then as shown in test_xbatcher.ipynb, you can generate on-the-fly

import xarray_batcher as xb

dl = xb.DataModule()

for d in dl:
    print(d)
    break

This will call the generator function to randomly sample discontinuous paths of rainfall images. Patching is done within to return 128x128 images or the whole domain for validation set.

Timeline

Check the issue tracker here. The most pressing changes:

Deprecate initial GFS load in and Tensorflow batcher modalities using kerchunk (moved to experimental_tensorflow)
Add IterableDataset switch: not sure if a one-size-fits-all here is possible but when we switch to having both forecast and truth data we need to switch to data streaming (potentially with reload after n epoch but this is overkill from prelim. experiments). Main steps here is to refactor the dask based open_mfzarr and modify functions to allow slicing on single files.
Allow custom-collate function that returns numpy function and implement np.savez for JAX
Allow load in from pre-created npz file with memory mapping

Appendix and FAQ

:::info Find this document incomplete? Leave a comment! :::

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
xarray_batcher		xarray_batcher
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml
test_xbatcher.ipynb		test_xbatcher.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Xarray Loader for forecast data

Gist of it.

Data loading flow

Timeline

Appendix and FAQ

About

Uh oh!

Releases

Packages

Uh oh!

Languages

snath-xoc/xarray_loader

Folders and files

Latest commit

History

Repository files navigation

Xarray Loader for forecast data

Gist of it.

Data loading flow

Timeline

Appendix and FAQ

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages