Skip to content

Conversation

@Tomatokeftes
Copy link

@Tomatokeftes Tomatokeftes commented Jan 27, 2026

Summary

This PR adds support for lazy loading of tables in SpatialData using anndata's experimental read_lazy() function.

Motivation

Currently, all elements in SpatialData (images, labels, points) are loaded lazily using Dask, except for tables which are always loaded into memory. For large datasets, particularly Mass Spectrometry Imaging (MSI) data where tables can contain millions of pixels with hundreds of thousands of m/z bins, this creates memory bottlenecks.

Changes

  • Add lazy: bool = False parameter to SpatialData.read() and read_zarr()
  • Add lazy: bool = False parameter to _read_table() in io_table.py
  • Use anndata.experimental.read_lazy() when lazy=True
  • Add _is_lazy_anndata() helper function to detect lazy AnnData objects
  • Modify validation to skip eager checks for lazy tables (prevents defeating lazy loading)
  • Add fallback with warning if anndata version doesn't support read_lazy

Usage

from spatialdata import SpatialData

# Load tables lazily (keeps large tables out of memory)
sdata = SpatialData.read("large_dataset.zarr", lazy=True)

# Access table - data is loaded on-demand
table = sdata.tables["my_table"]
# table.X is now backed by Dask/Zarr, not loaded into memory

Reproducible Example

This self-contained example demonstrates lazy loading with 99% memory savings:

import numpy as np
from scipy import sparse
import anndata as ad
import psutil
import tempfile
from pathlib import Path

# Create synthetic sparse data (100k pixels x 100k m/z bins, 3000 peaks/pixel)
rng = np.random.default_rng(42)
n_pixels, n_mz, peaks_per_pixel = 100000, 100000, 3000
nnz = int(n_pixels * peaks_per_pixel)

X = sparse.csc_matrix(
    (rng.lognormal(7, 1.5, nnz).astype(np.float32),
     (rng.integers(0, n_pixels, nnz), rng.integers(0, n_mz, nnz))),
    shape=(n_pixels, n_mz)
)
print(f"Sparse matrix: {X.shape}, nnz={X.nnz:,}")

# Create and write AnnData
grid_width = int(np.ceil(np.sqrt(n_pixels)))
adata = ad.AnnData(
    X=X,
    obs={"x": np.arange(n_pixels) % grid_width, "y": np.arange(n_pixels) // grid_width},
)
adata.obs_names = [f"pixel_{i}" for i in range(n_pixels)]
adata.var_names = [f"mz_{i}" for i in range(n_mz)]

zarr_path = Path(tempfile.mkdtemp()) / "test.zarr"
adata.write_zarr(str(zarr_path))

# Compare lazy vs eager loading
from anndata.experimental import read_lazy
from anndata import read_zarr

def get_mem():
    return psutil.Process().memory_info().rss / 1e6

# Lazy loading
mem_before = get_mem()
adata_lazy = read_lazy(str(zarr_path))
mem_lazy = get_mem() - mem_before

# Eager loading  
mem_before = get_mem()
adata_eager = read_zarr(str(zarr_path))
mem_eager = get_mem() - mem_before

print(f"Lazy:  +{mem_lazy:.1f} MB, X type: {type(adata_lazy.X).__name__}")
print(f"Eager: +{mem_eager:.1f} MB, X type: {type(adata_eager.X).__name__}")
print(f"Memory savings: {(1 - mem_lazy/mem_eager)*100:.0f}%")

Expected output:

Sparse matrix: (100000, 100000), nnz=295,547,702
Lazy:  +15.4 MB, X type: Array
Eager: +2270.7 MB, X type: csc_matrix
Memory savings: 99%

Requirements

  • Requires anndata >= 0.12 for lazy loading support
  • Falls back to eager loading with a warning if anndata version is older

Real-world use case

This feature was developed for Thyra, a Mass Spectrometry Imaging converter. MSI datasets can have:

  • Millions of pixels (observations)
  • Hundreds of thousands of m/z bins (variables)
  • Resulting in tables that exceed available RAM

With lazy loading, users can work with these datasets without loading the full table into memory.

Test plan

  • test_lazy_read_basic - Verify lazy=True creates a SpatialData object without errors
  • test_lazy_false_loads_normally - Verify lazy=False maintains current behavior
  • test_read_zarr_lazy_parameter - Verify lazy parameter is passed through correctly

Tomatokeftes and others added 2 commits January 27, 2026 11:58
Add a `lazy` parameter to `SpatialData.read()` and `read_zarr()` that enables
lazy loading of tables using anndata's experimental `read_lazy()` function.

This is particularly useful for large datasets (e.g., Mass Spectrometry Imaging
with millions of pixels) where loading tables into memory is not feasible.

Changes:
- Add `lazy: bool = False` parameter to `read_zarr()` in io_zarr.py
- Add `lazy: bool = False` parameter to `_read_table()` in io_table.py
- Add `lazy: bool = False` parameter to `SpatialData.read()` in spatialdata.py
- Add `_is_lazy_anndata()` helper to detect lazy AnnData objects
- Skip eager validation for lazy tables to preserve lazy loading benefits
- Add tests for lazy loading functionality

Requires anndata >= 0.12 for lazy loading support. Falls back to eager loading
with a warning if anndata version does not support read_lazy.
@Tomatokeftes Tomatokeftes marked this pull request as draft January 27, 2026 11:03
- Simplify if/return pattern in _is_lazy_anndata (SIM103)
- Add missing TableModel import in test fixture (F821)
- Use modern np.random.Generator instead of np.random.rand (NPY002)
@codecov
Copy link

codecov bot commented Jan 27, 2026

Codecov Report

❌ Patch coverage is 87.50000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.17%. Comparing base (ef88b5c) to head (120d11d).

Files with missing lines Patch % Lines
src/spatialdata/_io/io_table.py 63.63% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1055      +/-   ##
==========================================
- Coverage   92.21%   92.17%   -0.04%     
==========================================
  Files          49       49              
  Lines        7593     7607      +14     
==========================================
+ Hits         7002     7012      +10     
- Misses        591      595       +4     
Files with missing lines Coverage Δ
src/spatialdata/_core/spatialdata.py 91.96% <100.00%> (ø)
src/spatialdata/_io/io_zarr.py 92.38% <100.00%> (+0.07%) ⬆️
src/spatialdata/models/models.py 88.70% <100.00%> (+0.10%) ⬆️
src/spatialdata/_io/io_table.py 83.67% <63.63%> (-6.58%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Tomatokeftes Tomatokeftes marked this pull request as ready for review January 27, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant