-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Should we expose the rook outputs via object store?
Overview
Some users of the rook WPS will be building workflows in tools such as JupyterLab which would benefit from being able to read partial files remotely. This approach would work if we made the outputs visible via our (CEDA and DKRZ) object stores. There are a number of important issues related to this proposal. This document discusses them.
Issues
Should we write to object store as well as POSIX?
The current system, writing NetCDF files to POSIX file systems, works well. We propose that it should remain the primary way in which all WPS processes generate and share their outputs.
The proposal for object store would work by one of the two following methods:
- Write NetCDF to POSIX, and expose those NetCDF files via an object store interface as well
- Write to a Zarr cache on object store
Can we expose the same NetCDF files via POSIX and object store?
If the service provider (i.e. DRKZ or CEDA) has the capability to expose the same storage as both POSIX and object store then the most efficient approach would be to:
- Write NetCDF files as usual
- Generate a ReferenceFileSystem file (see: https://github.com/intake/fsspec-reference-maker/blob/main/fsspec_reference_maker/hdf.py) to describe the contents of the NetCDF files (with URL addresses in object store)
- Return both references in the response:
- download URLs to NetCDF files
- a ReferenceFileSystem file that can be read by the xarray/zarr libraries
How would a separate Zarr cache work?
There are various issues to resolve if we want to write a Zarr cache of the outputs.
Zarr cache issue 1: Zarr files should not be split like NetCDF files
The chunking/splitting mechanism is clisops exists to avoid excessive memory usage by any particular process. The code splits each dataset into a manageable chunk before writing into separate NetCDF files.
The philosophy for managing and writing Zarr files is different: we would expect Xarray/Dask to manage the chunking itself so we would not need to separate out the chunks and handle them as individual files. Enabling both approaches simultaneously in clisops is likely to involve some significant refactoring of the workflow.
Zarr cache issue 2: Access control
Assuming that the output data might require some level of access control, we would expect to use some kind of token-based access.
Alternatively (or as well), we could add in a job identifier into the output path, and this would only be known to the client that receives the response.
Zarr cache issue 3: Managing the volume/duration of the cache
The cache will need to be managed. A simple solution would be to define some controlling parameters for managing buckets such as:
time_window(i.e. number of days a bucket is open for writing outputs to)lifetime(i.e. number of days/months before a bucket will be deleted)
This approach would allow a simple workflow in the roocs stack as follows:
- Get current bucket ID
- Create bucket if does not exist
- Write the outputs to the current bucket
- Return response to client that includes references to that bucket
Separately, a scheduled task can run on the server as follows:
- Find buckets that has an expired
lifetime - Delete those buckets
We need publicly accessible object stores
Note that this approach will only work if the object store interface is accessible on the wider internet (i.e. not within institutional firewalls).