OME Metadata Support

*This issue captures the requirements as well as the possible implementation choices for a first integration of OME metadata into the OME-NGFF container. The UoD team is targeting the specification as well as implementations, including within OMERO, by mid-2022.*

## Goal

NGFF specifications up to version 0.4 contain only minimal metadata fields that cover the existing OME model (e.g. physical pixel size). As a result, converting data into OME-NGFF, e.g. with `bioformats2raw`, loses more metadata than the equivalent conversion to OME-TIFF. **The goal of this issue is to achieve a parity between the two formats in terms of capturing metadata contained in the OME model (2016-06).**

Here we would like to discuss, plan, and specify an initial integration of the OME model into OME-NGFF. As with other specifications, this initial work will likely be followed by multiple, possibly breaking, changes to expand the scope. Where possible, we will also try to capture that roadmap here.

## In-scope requirements

-   The primary requirement is the ability to fully convert an OME-TIFF in its entirety into an OME-NGFF dataset.

-   As a corollary, it should be also possible to capture what Bio-Formats knows about proprietary files formats (PFFs). This should cover minimally the intermediate Zarr output of `bioformats2raw` and be readable by all readers (not just `raw2ometiff`).

-   The specification should make clear how the combination of the OME model metadata and NGFF metadata is to be interpreted and what readers should do in the case of a conflict. 

## Out-of-scope requirements

-   This need not be the final mechanism used for OME-NGFF to store metadata. It is more important to capture the metadata that exists today.

-   It is not necessary, at least initially, that an OME-NGFF be fully convertible back into an OME-TIFF since, e.g., there is no location for labels, transformations, or file annotations in OME-TIFF.

----

## Design decision \#1: Location of metadata

The current NGFF specification solely uses the formats custom-attributes (`.zattrs`) for storing metadata. Several other locations are conceivable, though some are more or less within the bounds of the Zarr specification (See <https://github.com/zarr-developers/zarr-specs/issues/112> for more discussion.)

### Option a) `.zattrs`

The status quo at the moment is that all metadata should be represented as JSON in `.zattrs`. The benefit is that no new mechanism needs to be introduced. A downside is that metadata is spread across multiple zgroups and zarrays (See related comments in <https://github.com/ome/ngff/issues/102>). Projects such as xarray store metadata in “well-known” keys within the `.zattrs` like `_ARRAY_DIMENSIONS` ([docs](https://xarray.pydata.org/en/stable/internals/zarr-encoding-spec.html)).

### Option b) Custom files

An alternative is to introduce new files outside the scope of the Zarr spec, which only defines `.zattrs`, `.zarray`, `.zgroup`, and chunk files. `bioformats2raw` currently stores metadata in a file named `METADATA.ome.xml`. Other projects like netcdf-c store custom files (e.g. `.nczarr`; [docs](https://docs.unidata.ucar.edu/netcdf-c/current/md_nczarr.html)) with their own proprietary customizations. The benefit of this strategy is maximum flexibility since no key conflicts can occur. Implementations may need to be aware that such files are essentially 1-dimensional byte arrays.

### Option c) Arrays

Metadata files can be encoded as Zarr arrays, which is similar to option b) but does not require introducing any new Zarr behavior. Additionally, the files themselves can carry metadata in their own `.zattrs` and be chunked. However, all tools that wish to consume them must be Zarr-aware.

### Option d) String

Metadata can be encoded as a single (albeit large) string within `.zattrs`. Depending on Design Decision \#2, storing a single string with the metadata has the advantage of working with existing formats as well as consolidating the metadata, but it does require escaping, etc.

----

## Design decision \#2: Format of metadata

Similarly to \#1, currently all metadata is stored as JSON within `.zattrs`. 

### Option a) Design a JSON format

The option closest to the current NGFF process would consist of specifying a new JSON format to capture all of the information in OME-XML. This process would likely be extended and would need to be maintained for some time. One route to achieving it would be to generate json-schema from the XSD using [ome-types](https://pypi.org/project/ome-types/).

### Option b) Use the JSON-LD syntax of OME-OWL

Using JSON-LD would keep the metadata in JSON but would make use of the existing work on [OME-OWL](https://gitlab.com/openmicroscopy/incubator/ome-owl/-/blob/master/ontology/owl/ome_core/ome_core_owl.json), and therefore not create another format that needs supporting. Additionally, the JSON-LD model provides an extensibility that is needed within the community. The downside is increased complexity in the programming model.

### Option c) Store the OME-XML directly.

Finally, if the first goal is to support the existing model, using the OME-XML model is likely the fastest route. Downsides include the general aversion felt towards XML as well as the need to map between XML elements/identities and objections specified within the JSON. There will also not be an extensibility (beyond the standard annotations) in the first instance.

----

## Implementation reports

Below we enumerate possible implementations and (eventually) the status of investigations into each of them. If anyone else is interested in proposing (or especially prototyping) an implementation, please mention so below.

### 1b2c: standardize the current bioformats2raw format

Standardizing the bioformats2raw output would require:

 - permitting multiple images in a single Zarr fileset which would likely be initially achieved by adopting the OME-XML collections as opposed to those discussed in https://github.com/ome/ngff/issues/31
 - reading the OME-XML from the well-known location (OME/METADATA.ome.xml) or potentially adding metadata to find that metadata
 - deciding how duplicated metadata fields are to be handled (and/or synchronized)

An additional benefit of this implementation is that the current `bioformats2raw` code can be adopted as the official .

*Related issues:*
- [x] https://github.com/glencoesoftware/bioformats2raw/issues/126
- [ ] https://github.com/glencoesoftware/raw2ometiff/pull/71

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OME Metadata Support #104

Goal

In-scope requirements

Out-of-scope requirements

Design decision #1: Location of metadata

Option a) `.zattrs`

Option b) Custom files

Option c) Arrays

Option d) String

Design decision #2: Format of metadata

Option a) Design a JSON format

Option b) Use the JSON-LD syntax of OME-OWL

Option c) Store the OME-XML directly.

Implementation reports

1b2c: standardize the current bioformats2raw format

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OME Metadata Support #104

Description

Goal

In-scope requirements

Out-of-scope requirements

Design decision #1: Location of metadata

Option a) .zattrs

Option b) Custom files

Option c) Arrays

Option d) String

Design decision #2: Format of metadata

Option a) Design a JSON format

Option b) Use the JSON-LD syntax of OME-OWL

Option c) Store the OME-XML directly.

Implementation reports

1b2c: standardize the current bioformats2raw format

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Option a) `.zattrs`