Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions rfc/9/comments/4/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# RFC-9: Comment 4

(rfcs:rfc9:comment4)=

## Comment authors

This comment was written by: Lenard Spiecker<sup>1</sup> and Matthias Grunwald<sup>1</sup>

<sup>1</sup> Miltenyi Biotec B.V. & Co. KG

## Conflicts of interest

None.

## Summary

We support standardizing single-file OME-Zarr via ZIP (.ozx). In our context at Miltenyi Biotec — involving large 3D volumes, sometimes isolated instruments, and the use of portable drives — a single file improves the user experience compared to directory-backed Zarr stores. We have begun a C++ implementation for zipped OME-Zarr writing and reading. However, we noticed some challenging details when using ZIP as a single-file store compared to other Zarr stores. Our primary goal was to ensure high-throughput writing during acquisition. In addition, we wanted to enable update and append operations, such as when metadata needs to be changed or a label is added. Overall, RFC-9 aligns with these goals and should improve interoperability and adoption.

## Minor comments and questions

- **Sharding constraints:** Although sharding within a single file can feel somewhat counterintuitive, it helps reduce the size of the central directory and enables easier compatibility and conversion between different storage backends. However, it must be noted that partial writes are impractical unless the final size of each shard is known in advance. As a result, it is often not recommended to shard an axis that is acquired sequentially. (E.g., a Z-axis that is acquired slice-by-slice cannot be written by chunk when sharded along the Z-axis unless the codec pipeline produces a fixed size or the whole slice constitutes a single shard.) This point could be added under the drawback section in the RFC. The recommendation to SHOULD use sharding depends of course on the chunk size, the expected number of chunks and the codec pipeline.

- **CRC/hash requirement:** ZIP requires CRC32 for file entries, which is useful for integrity verification, but it burdens implementations, especially for partial writes and reads. With sharding, recomputing CRCs for sub-ranges or appends is tricky; clarifying recommended strategies (e.g., validating at shard or chunk granularity and deferring CRC checks for in-flight writes) would help implementers. Implementers should also note that there is no support for CRC32(B) SIMD instructions on x86_64 (SSE4.2 only supports CRC32(C)). This point could be added under the drawback section in the RFC.

- **Ordering of zarr.json first:** While placing the root and all other `zarr.json` files at the beginning of the archive potentially aids discovery and streaming access, practical implementations may still read the ZIP comment together with the central directory first. The main reason is that the first `zarr.json` can become obsolete, rendering streaming access inefficient compared to seeking. We also observed that strict file ordering cannot be maintained when appending a new `zarr.json` (e.g., adding labels) to an existing .ozx file. Furthermore, we encounter cases where metadata is generated during acquisition; therefore, we lean toward writing data first and metadata second to avoid writing it twice. For the stated reasons, we will likely not produce .ozx files with `zarr.json` files ordered first.
Copy link
Member

@mkitti mkitti Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than for streaming applications, the ordering of files in the archive is perhaps secondary to the ordering file listings in the central directory.

One particular concern that I have is the scattering of metadata, the zarr.json files, across the archive. If they were consolidated such that a single byte range request could obtain them that would be helpful.

My priorities here in order of decreasing importance are thus:

  1. Listing of zarr.json files first in the central directory.
  2. Consolidation of zarr.json files in the archive
  3. The location of the root zarr.json at the beginning of the archive.

The acqusition case is interesting. My initial expectations would be for the zarr array to be saved outside of the zip archive, and the zip archive would be constructed after the end of acquisition. However, I do see the appeal of acquiring directly into a single file.

The main case for acquisition into a single file or a series of large files I considered would be acquiring into a Zarr shard. A simpler Zarr archive perhaps would focus on a single array and how to pack to the zarr.json file into the shard.

Directly acquiring into a zip file deserve more consideration and is likely to see more applications than streaming the archive from beginning to end. In fact it seems like streaming the archive in reverse with the last parts of the file being send first would have advantages here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it might be better at the end of the archive if you seek the CDH.

Unfortunately consolidation inside zip comment might be also not a good idea (due to 65k limit, and O(N) search), but would reduce requests..


- **Ordering of zarr.json in central directory:** This is reasonable for discoverability, especially for the root `zarr.json` if consolidated metadata is present. For all other `zarr.json` files, and for a root `zarr.json` without consolidated metadata, this seems less relevant for us and depends on the number of file entries. Therefore in our use case we might omit it for now and introduce it later if needed.
Copy link
Member

@mkitti mkitti Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidated metadata has not been specified either by Zarr or OME-Zarr, thus this RFC could not rely on its existence. The listing of other zarr.json files in the beginning of the central directory is a consolidation attempt within the scope of this RFC that does not rely on unspecified extensions. However, the emphasis on the root zarr.json being discoverable does anticipate that consolidated metadata or similar mechanisms could be used to quickly discern the structure of the archive.

Due the location at the end, my expectation is that the central directory could be reasonably rewritten to meet sorting requirements if needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarification.

Yes, it's just a partition coupled with a sort on the first part. Should be fast as paths lengths and number of zarr.jsons are small.


- **ZIP disadvantage when updating:** In our application, we noticed that the non-destructive design of ZIP does not allow updating existing values in place. (We observed the current `zarr-python` implementation writing "zarr.json" multiple times.) In our implementation, we allowed in-place updates as long as the size does not grow beyond the existing space. As an example, we added capacity (padding) to allow in-place updates of metadata. (Similar to `tiffcomment` or `tiffset` on tiff files). We think that many zip implementations do not support in-place updates. The RFC mentions already adding and expanding files as a drawback. We just wanted to mention here how this could be mitigated in certain cases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur that zip implementations could leave extra space, capacity as you state, between file entries to allow for growth. I also observe that this is not a common practice in current zip implementations or libraries at the moment.


- **ZIP disadvantage in performance:** Compared to a directory store, file content is not necessarily stored page-aligned. In our implementation, we observed a significant performance impact for both reading and writing when using unbuffered, page-aligned I/O. To avoid read-modify-write cycles, we allocated a separate page for each local file header and kept partially filled pages empty. We also ensured this for chunks inside shards as well as the shard index. Unfortunately, due to the local file header, this results in memory overhead, though this is acceptable when sharding is turned on and chunksize is not too small. This point could be added under the drawback section in the RFC.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, is the drawback that the zip local file header makes page alignment more difficult?

Copy link
Author

@l-spiecker l-spiecker Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excuse, this was unclear:

The local file header has in our implementation a fixed size of 30bytes + 20bytes extra fields + filename_length. If you spend a whole page (4096 Bytes) for the local file header, you have overhead in file size. For a 5000³px image with 64³ chunksize without sharding, you have about 512,000 chunks - results in 2GByte local file header. As said this can be mitigated by sharding, bigger chunksize or simply read-modify-write when writing the LFH or when writing next to the LFH.

Maybe this is not a real drawback. Our read/write implementation just tries to avoid any memcopy on full chunk reads. In most other cases you need a memcopy for e.g. compression, chunk joining, etc. anyways.


- **Split archives:** Field realities sometimes require multi-volume transport. Although splitting (e.g., channels or a measurement series) into smaller datasets is often possible — and recommended for other non-splittable file formats like .czi and .ims — we see use cases where archive-level splitting would be beneficial, particularly from a user-experience perspective. However, we acknowledge that this adds complexity to implementations, and support this decision.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that zip archives may need to be split for transport, it is not clear to me that implementations would need to address the archive while in split form. Are there situations where an archive would could not be reassembled before being accessed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be no situation. But from user-experience perspective an implicit reassembling might be preferred over an explicit one.

I think there are use cases for transport and/or use cases for storage.

One example we thought of, was: Like a camera/recorder having two SD cards to hot swap, a microscope could also have two portable drives to hot swap.

In general file limits are everywhere, but workarounds too. For example chatgpt generated typical defaults list:

  • Email attachments: ~10–25 MB
  • Messengers (WhatsApp, Telegram, Slack): ~1–4 GB
  • Web uploads (PHP / backend): ~2–50 MB (defaults often much lower)
  • APIs (REST / GraphQL): ~1–10 MB per request
  • Reverse proxies / load balancers: ~1–100 MB
  • Cloud storage (Drive, OneDrive, Dropbox): ~100 GB to multiple TB
  • File transfer services (e.g. WeTransfer): ~2–20 GB
  • USB / SD with FAT32: 4 GB per file
  • USB / SD with exFAT or NTFS: practically unlimited
  • Filesystems (general): 4 GB (FAT32) → TB/EB (modern)
  • Databases (per field / packet): ~10 MB – 1 GB
  • Docker images / build artifacts: ~100 MB to multiple GB

Of course splitting can be mitigated by sharding together with optional use of "." as chunk key encoding separator. But this might produce more files and variable file sizes depending on sharding and compression.

Overall we see this as a very low prio and it should not slow down this RFC. There might also many zip libraries not supporting the pkware split/spanned ZIP standard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zarr already defines 2 ways to partition data -- by using separate arrays, and by choosing an appropriate chunking scheme for a given array. We should be sure we have exhausted these two schemes before introducing yet another one.


- **Thumbnails:** Applications might benefit from pre-rendered thumbnails. As there is no standardized way to store thumbnails for Zarr and OME-Zarr it might be a question if this should be a topic to be addressed by zipped OME-Zarr separately or if this is out of scope for this RFC. As an example many ZIP based formats (e.g. docx, 3mf) follow the Open Packaging Conventions to store thumbnails in a standardized way.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thumbnails should be addressed in OME-Zarr more generally.

Compatability with the Open Packaging Conventions is an interesting idea, although I am somewhat reluctant to introduce a XML standard into the base specification.


In general, the specification could recommend in the end using specific implementations over standard ZIP writers so that end users can create compatible .ozx files and avoid interoperability issues.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Zarr tradition has been to remain implementation agnostic, but I do anticipate that as implementers gain experience as you have that there will be a need to share implementation details.

Interoperability and validation is an important consideration, but these are community efforts beyond the scope of the RFC.


## Recommendation

Accept.