-
Notifications
You must be signed in to change notification settings - Fork 62
Description
This issue captures the requirements as well as the possible implementation choices for a first integration of OME metadata into the OME-NGFF container. The UoD team is targeting the specification as well as implementations, including within OMERO, by mid-2022.
Goal
NGFF specifications up to version 0.4 contain only minimal metadata fields that cover the existing OME model (e.g. physical pixel size). As a result, converting data into OME-NGFF, e.g. with bioformats2raw, loses more metadata than the equivalent conversion to OME-TIFF. The goal of this issue is to achieve a parity between the two formats in terms of capturing metadata contained in the OME model (2016-06).
Here we would like to discuss, plan, and specify an initial integration of the OME model into OME-NGFF. As with other specifications, this initial work will likely be followed by multiple, possibly breaking, changes to expand the scope. Where possible, we will also try to capture that roadmap here.
In-scope requirements
-
The primary requirement is the ability to fully convert an OME-TIFF in its entirety into an OME-NGFF dataset.
-
As a corollary, it should be also possible to capture what Bio-Formats knows about proprietary files formats (PFFs). This should cover minimally the intermediate Zarr output of
bioformats2rawand be readable by all readers (not justraw2ometiff). -
The specification should make clear how the combination of the OME model metadata and NGFF metadata is to be interpreted and what readers should do in the case of a conflict.
Out-of-scope requirements
-
This need not be the final mechanism used for OME-NGFF to store metadata. It is more important to capture the metadata that exists today.
-
It is not necessary, at least initially, that an OME-NGFF be fully convertible back into an OME-TIFF since, e.g., there is no location for labels, transformations, or file annotations in OME-TIFF.
Design decision #1: Location of metadata
The current NGFF specification solely uses the formats custom-attributes (.zattrs) for storing metadata. Several other locations are conceivable, though some are more or less within the bounds of the Zarr specification (See zarr-developers/zarr-specs#112 for more discussion.)
Option a) .zattrs
The status quo at the moment is that all metadata should be represented as JSON in .zattrs. The benefit is that no new mechanism needs to be introduced. A downside is that metadata is spread across multiple zgroups and zarrays (See related comments in #102). Projects such as xarray store metadata in “well-known” keys within the .zattrs like _ARRAY_DIMENSIONS (docs).
Option b) Custom files
An alternative is to introduce new files outside the scope of the Zarr spec, which only defines .zattrs, .zarray, .zgroup, and chunk files. bioformats2raw currently stores metadata in a file named METADATA.ome.xml. Other projects like netcdf-c store custom files (e.g. .nczarr; docs) with their own proprietary customizations. The benefit of this strategy is maximum flexibility since no key conflicts can occur. Implementations may need to be aware that such files are essentially 1-dimensional byte arrays.
Option c) Arrays
Metadata files can be encoded as Zarr arrays, which is similar to option b) but does not require introducing any new Zarr behavior. Additionally, the files themselves can carry metadata in their own .zattrs and be chunked. However, all tools that wish to consume them must be Zarr-aware.
Option d) String
Metadata can be encoded as a single (albeit large) string within .zattrs. Depending on Design Decision #2, storing a single string with the metadata has the advantage of working with existing formats as well as consolidating the metadata, but it does require escaping, etc.
Design decision #2: Format of metadata
Similarly to #1, currently all metadata is stored as JSON within .zattrs.
Option a) Design a JSON format
The option closest to the current NGFF process would consist of specifying a new JSON format to capture all of the information in OME-XML. This process would likely be extended and would need to be maintained for some time. One route to achieving it would be to generate json-schema from the XSD using ome-types.
Option b) Use the JSON-LD syntax of OME-OWL
Using JSON-LD would keep the metadata in JSON but would make use of the existing work on OME-OWL, and therefore not create another format that needs supporting. Additionally, the JSON-LD model provides an extensibility that is needed within the community. The downside is increased complexity in the programming model.
Option c) Store the OME-XML directly.
Finally, if the first goal is to support the existing model, using the OME-XML model is likely the fastest route. Downsides include the general aversion felt towards XML as well as the need to map between XML elements/identities and objections specified within the JSON. There will also not be an extensibility (beyond the standard annotations) in the first instance.
Implementation reports
Below we enumerate possible implementations and (eventually) the status of investigations into each of them. If anyone else is interested in proposing (or especially prototyping) an implementation, please mention so below.
1b2c: standardize the current bioformats2raw format
Standardizing the bioformats2raw output would require:
- permitting multiple images in a single Zarr fileset which would likely be initially achieved by adopting the OME-XML collections as opposed to those discussed in Collections Specification #31
- reading the OME-XML from the well-known location (OME/METADATA.ome.xml) or potentially adding metadata to find that metadata
- deciding how duplicated metadata fields are to be handled (and/or synchronized)
An additional benefit of this implementation is that the current bioformats2raw code can be adopted as the official .
Related issues: