Skip to content

Msv2 dask lazy read#565

Draft
r-xue wants to merge 2 commits intocasangi:mainfrom
r-xue:msv2-dask-lazy-read
Draft

Msv2 dask lazy read#565
r-xue wants to merge 2 commits intocasangi:mainfrom
r-xue:msv2-dask-lazy-read

Conversation

@r-xue
Copy link
Copy Markdown
Collaborator

@r-xue r-xue commented Apr 3, 2026

While Zarr-backed lazy reading after paying a one-time conversion cost is certainly the right performance choice, there might still be scientific use cases where directly opening an MSv2 lazily — without converting — is preferable -- even it might be slightly sluggish:

  • Interactive quick inspection
  • Simple numer crunching in memory-constrained environments
  • Streaming MSv2 to Zarr with on-the-fly manipulation (e.g. cal-application)
  • ...

This is a draft proposal and still a work in progress. And certain ideas are borrowed from xarray-ms. I vaguely remember there was earlier support, but it was later dropped? If this doesn't fit the library design, I could also spin it off to a separate repo.

On the other hand, I am not sure about the plan or what has been done on profiling and adding the arcea backend. Some elements from that proposal could be relevant here.

r-xue added 2 commits April 3, 2026 00:26
Add `open_msv2()` and supporting infrastructure for lazily opening a
CASA MSv2 as an MSv4-schema `xr.DataTree` backed by dask arrays,
without triggering a full msv2→msv4 conversion.

- `open_msv2.py`: new entry point; builds partitions lazily via
  `_build_partition_lazy()`; adds a TTL-based partition cache to avoid
  re-scanning the MS on repeated calls; hardens OBSERVATION subtable and
  FIELD/SOURCE table reads with graceful fallbacks.

- `read.py`: add `read_col_conversion_dask_sparse()` and its per-chunk
  helper `_load_col_chunk_sparse()` for MSv2 files where not every time
  step contains every baseline (antenna dropouts, flagged-row removal,
  etc.). Uses `np.bincount` + cumulative row offsets to correctly map
  rows to `(time, baseline)` slots without assuming a constant stride.
  Thread-safety: move `SerializableLock` import to module level and pass
  the lock through to `load_col_chunk`.

- `conversion.py`: extend `get_read_col_conversion_function()` to
  dispatch to the new sparse reader when `parallel_mode="sparse"`;
  update docstring to reflect three modes (`none`, `time`, `sparse`).

- `_msv2_backend.py`: new xarray `BackendEntrypoint` registering the
  `xradio:msv2` engine, so `xr.open_datatree(path, engine="xradio:msv2")`
  works out of the box.
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 3, 2026

Codecov Report

❌ Patch coverage is 14.28571% with 216 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/xradio/measurement_set/open_msv2.py 14.72% 139 Missing ⚠️
...radio/measurement_set/_utils/_msv2/_tables/read.py 7.57% 61 Missing ⚠️
src/xradio/measurement_set/_msv2_backend.py 0.00% 15 Missing ⚠️
.../xradio/measurement_set/_utils/_msv2/conversion.py 85.71% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant