feat: Add lazy table loading via anndata.experimental.read_lazy #1055

Tomatokeftes · 2026-01-27T10:59:03Z

Summary

This PR adds support for lazy loading of tables in SpatialData using anndata's experimental read_lazy() function.

Motivation

Currently, all elements in SpatialData (images, labels, points) are loaded lazily using Dask, except for tables which are always loaded into memory. For large datasets, particularly Mass Spectrometry Imaging (MSI) data where tables can contain millions of pixels with hundreds of thousands of m/z bins, this creates memory bottlenecks.

Changes

Add lazy: bool = False parameter to SpatialData.read() and read_zarr()
Add lazy: bool = False parameter to _read_table() in io_table.py
Use anndata.experimental.read_lazy() when lazy=True
Add _is_lazy_anndata() helper function to detect lazy AnnData objects
Modify validation to skip eager checks for lazy tables (prevents defeating lazy loading)
Add fallback with warning if anndata version doesn't support read_lazy

Usage

from spatialdata import SpatialData

# Load tables lazily (keeps large tables out of memory)
sdata = SpatialData.read("large_dataset.zarr", lazy=True)

# Access table - data is loaded on-demand
table = sdata.tables["my_table"]
# table.X is now backed by Dask/Zarr, not loaded into memory

Reproducible Example

This self-contained example demonstrates lazy loading with 99% memory savings:

import numpy as np
from scipy import sparse
import anndata as ad
import psutil
import tempfile
from pathlib import Path

# Create synthetic sparse data (100k pixels x 100k m/z bins, 3000 peaks/pixel)
rng = np.random.default_rng(42)
n_pixels, n_mz, peaks_per_pixel = 100000, 100000, 3000
nnz = int(n_pixels * peaks_per_pixel)

X = sparse.csc_matrix(
    (rng.lognormal(7, 1.5, nnz).astype(np.float32),
     (rng.integers(0, n_pixels, nnz), rng.integers(0, n_mz, nnz))),
    shape=(n_pixels, n_mz)
)
print(f"Sparse matrix: {X.shape}, nnz={X.nnz:,}")

# Create and write AnnData
grid_width = int(np.ceil(np.sqrt(n_pixels)))
adata = ad.AnnData(
    X=X,
    obs={"x": np.arange(n_pixels) % grid_width, "y": np.arange(n_pixels) // grid_width},
)
adata.obs_names = [f"pixel_{i}" for i in range(n_pixels)]
adata.var_names = [f"mz_{i}" for i in range(n_mz)]

zarr_path = Path(tempfile.mkdtemp()) / "test.zarr"
adata.write_zarr(str(zarr_path))

# Compare lazy vs eager loading
from anndata.experimental import read_lazy
from anndata import read_zarr

def get_mem():
    return psutil.Process().memory_info().rss / 1e6

# Lazy loading
mem_before = get_mem()
adata_lazy = read_lazy(str(zarr_path))
mem_lazy = get_mem() - mem_before

# Eager loading  
mem_before = get_mem()
adata_eager = read_zarr(str(zarr_path))
mem_eager = get_mem() - mem_before

print(f"Lazy:  +{mem_lazy:.1f} MB, X type: {type(adata_lazy.X).__name__}")
print(f"Eager: +{mem_eager:.1f} MB, X type: {type(adata_eager.X).__name__}")
print(f"Memory savings: {(1 - mem_lazy/mem_eager)*100:.0f}%")

Expected output:

Sparse matrix: (100000, 100000), nnz=295,547,702
Lazy:  +15.4 MB, X type: Array
Eager: +2270.7 MB, X type: csc_matrix
Memory savings: 99%

Requirements

Requires anndata >= 0.12 for lazy loading support
Falls back to eager loading with a warning if anndata version is older

Real-world use case

This feature was developed for Thyra, a Mass Spectrometry Imaging converter. MSI datasets can have:

Millions of pixels (observations)
Hundreds of thousands of m/z bins (variables)
Resulting in tables that exceed available RAM

With lazy loading, users can work with these datasets without loading the full table into memory.

Test plan

test_lazy_read_basic - Verify lazy=True creates a SpatialData object without errors
test_lazy_false_loads_normally - Verify lazy=False maintains current behavior
test_read_zarr_lazy_parameter - Verify lazy parameter is passed through correctly

Add a `lazy` parameter to `SpatialData.read()` and `read_zarr()` that enables lazy loading of tables using anndata's experimental `read_lazy()` function. This is particularly useful for large datasets (e.g., Mass Spectrometry Imaging with millions of pixels) where loading tables into memory is not feasible. Changes: - Add `lazy: bool = False` parameter to `read_zarr()` in io_zarr.py - Add `lazy: bool = False` parameter to `_read_table()` in io_table.py - Add `lazy: bool = False` parameter to `SpatialData.read()` in spatialdata.py - Add `_is_lazy_anndata()` helper to detect lazy AnnData objects - Skip eager validation for lazy tables to preserve lazy loading benefits - Add tests for lazy loading functionality Requires anndata >= 0.12 for lazy loading support. Falls back to eager loading with a warning if anndata version does not support read_lazy.

for more information, see https://pre-commit.ci

- Simplify if/return pattern in _is_lazy_anndata (SIM103) - Add missing TableModel import in test fixture (F821) - Use modern np.random.Generator instead of np.random.rand (NPY002)

codecov · 2026-01-27T11:16:26Z

Codecov Report

❌ Patch coverage is 87.50000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.17%. Comparing base (ef88b5c) to head (120d11d).

Files with missing lines	Patch %	Lines
src/spatialdata/_io/io_table.py	63.63%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1055      +/-   ##
==========================================
- Coverage   92.21%   92.17%   -0.04%     
==========================================
  Files          49       49              
  Lines        7593     7607      +14     
==========================================
+ Hits         7002     7012      +10     
- Misses        591      595       +4

Files with missing lines	Coverage Δ
src/spatialdata/_core/spatialdata.py	`91.96% <100.00%> (ø)`
src/spatialdata/_io/io_zarr.py	`92.38% <100.00%> (+0.07%)`	⬆️
src/spatialdata/models/models.py	`88.70% <100.00%> (+0.10%)`	⬆️
src/spatialdata/_io/io_table.py	`83.67% <63.63%> (-6.58%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Tomatokeftes and others added 2 commits January 27, 2026 11:58

[pre-commit.ci] auto fixes from pre-commit.com hooks

44d4f45

for more information, see https://pre-commit.ci

Tomatokeftes marked this pull request as draft January 27, 2026 11:03

fix: address pre-commit linting issues

120d11d

- Simplify if/return pattern in _is_lazy_anndata (SIM103) - Add missing TableModel import in test fixture (F821) - Use modern np.random.Generator instead of np.random.rand (NPY002)

Tomatokeftes marked this pull request as ready for review January 27, 2026 13:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add lazy table loading via anndata.experimental.read_lazy #1055

feat: Add lazy table loading via anndata.experimental.read_lazy #1055

Tomatokeftes commented Jan 27, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Add lazy table loading via anndata.experimental.read_lazy #1055

Are you sure you want to change the base?

feat: Add lazy table loading via anndata.experimental.read_lazy #1055

Conversation

Tomatokeftes commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Usage

Reproducible Example

Requirements

Real-world use case

Test plan

Uh oh!

codecov bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tomatokeftes commented Jan 27, 2026 •

edited

Loading

codecov bot commented Jan 27, 2026 •

edited

Loading