Discussion issue: providing outputs via object store

# Should we expose the rook outputs via object store?

## Overview

Some users of the `rook` WPS will be building workflows in tools such as JupyterLab which would benefit from being able to read partial files remotely. This approach would work if we made the outputs visible via our (CEDA and DKRZ) object stores. There are a number of important issues related to this proposal. This document discusses them.

## Issues 

### Should we write to object store as well as POSIX?

The current system, writing NetCDF files to POSIX file systems, works well. We propose that it should remain the primary way in which all WPS processes generate and share their outputs.

The proposal for object store would work by one of the two following methods:
1. Write NetCDF to POSIX, and **expose those NetCDF files via an object store interface as well**
2. Write to a **Zarr cache on object store**

### Can we expose the same NetCDF files via POSIX and object store?

If the service provider (i.e. DRKZ or CEDA) has the capability to expose the same storage as both POSIX and object store then the most efficient approach would be to:
1. Write NetCDF files as usual
2. Generate a ReferenceFileSystem file (see: https://github.com/intake/fsspec-reference-maker/blob/main/fsspec_reference_maker/hdf.py) to describe the contents of the NetCDF files (with URL addresses in object store)
3. Return both references in the response:
  - download URLs to NetCDF files 
  - a ReferenceFileSystem file that can be read by the xarray/zarr libraries

### How would a separate Zarr cache work?

There are various issues to resolve if we want to write a Zarr cache of the outputs.

#### Zarr cache issue 1: Zarr files should not be split like NetCDF files

The chunking/splitting mechanism is `clisops` exists to avoid excessive memory usage by any particular process. The code splits each dataset into a manageable chunk before writing into separate NetCDF files.

The philosophy for managing and writing Zarr files is different: we would expect Xarray/Dask to manage the chunking itself so we would not need to separate out the chunks and handle them as individual files. Enabling _both_ approaches simultaneously in `clisops` is likely to involve some significant refactoring of the workflow.

#### Zarr cache issue 2: Access control

Assuming that the output data might require some level of access control, we would expect to use some kind of **token-based access.**

Alternatively (or as well), we could add in a job identifier into the output path, and this would only be known to the client that receives the response.

#### Zarr cache issue 3: Managing the volume/duration of the cache

The cache will need to be managed. A simple solution would be to define some controlling parameters for managing buckets such as:
- `time_window` (i.e. number of days a bucket is open for writing outputs to)
- `lifetime` (i.e. number of days/months before a bucket will be deleted)

This approach would allow a simple workflow in the `roocs` stack as follows:
1. Get current bucket ID
2. Create bucket if does not exist
3. Write the outputs to the current bucket
4. Return response to client that includes references to that bucket

Separately, a scheduled task can run on the server as follows:
1. Find buckets that has an expired `lifetime`
2. Delete those buckets

### We need publicly accessible object stores

Note that this approach will only work if the object store interface is accessible on the wider internet (i.e. not within institutional firewalls).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discussion issue: providing outputs via object store #192

Should we expose the rook outputs via object store?

Overview

Issues

Should we write to object store as well as POSIX?

Can we expose the same NetCDF files via POSIX and object store?

How would a separate Zarr cache work?

Zarr cache issue 1: Zarr files should not be split like NetCDF files

Zarr cache issue 2: Access control

Zarr cache issue 3: Managing the volume/duration of the cache

We need publicly accessible object stores

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Discussion issue: providing outputs via object store #192

Description

Should we expose the rook outputs via object store?

Overview

Issues

Should we write to object store as well as POSIX?

Can we expose the same NetCDF files via POSIX and object store?

How would a separate Zarr cache work?

Zarr cache issue 1: Zarr files should not be split like NetCDF files

Zarr cache issue 2: Access control

Zarr cache issue 3: Managing the volume/duration of the cache

We need publicly accessible object stores

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions