Skip to content

New multiple table in-memory design #298

@LucaMarconato

Description

@LucaMarconato

Scope of the proposal

After receiving feedback from @EliHei2 @lopollar @melonora @LLehner @AdemSaglamRB, hearing users asking for multiple table support as an important requirement for the adoption of SpatialData in their tools (Maks, Louis) and having personally examined the development implications of the current table specification, I propose a new in-memory design for supporting multiple tables in SpatialData, that I have been maturing in the past months. Importantly, the on-disk table specification will not change.

Timeline and feedback

I propose to implement the new in-memory representation after the following work: aggregation and query refactoring (me), io refactoring (@giovp), new pipeline for daily tasting all the notebook (me), synthetic datasets for R/JS interoperability (me), napari bugfix (me and others). So ideally we would start in 1-1.5 month, enough time for receiving feedback and improving the design. In particular I kindly ask for feedback to @kevinyamauchi @ivirshup @gtca @ilia-kats. I put some usernames in raw not to spam them now, but I will tag everybody after the first draft is refined, and additionally @timtreis @berombau @joshmoore.

Disk specification

The disk storage will remain the one described in ome/ngff#64. What I propose to change is the in-memory representation of the data: instead of directly representing in-memory the metadata that we have on-disk, I propose to put the data in a form that is more ergonomic and that follows the (geo)pandas, anndata and muon indexing common practices.

In-memory representation

I propose the following.

  • To stop having one single table shared for the SpatialData object and instead allowing, for each object, to have either an AnnData table, either a MuData table associated (or also having no table associated to it). For keeping the implementation smaller, I would start just with one AnnData table, and allow to use also a MuData table in a follow up PR. On disk, the MuData table would be saved as a list of tables, so that the current specs are still valid.
  • To not represent region, region_key, instance_key in-memory, and instead relying on the same shared indexing between the table and the (single) corresponding shapes/labels element. Code examples below. This may require allowing AnnData to use custom indices and not just string (or at least integers), or would force the (geo)dataframes to have string indices. We will find a way around this.
  • As a consequence of the two points above I want to stress that this would disallow a table to refer to multiple shapes/labels elements. If the user wants a single table annotating multiple elements, the user would need to merge the table, and then split it again to save it. I will show code example for doing this later. I believe that the performance implication of having to do this manual merge and split are negligible compared to the other types of data operations that we have (query, aggregation, etc), and since the current approach requires a lot of internal splits and merges anyways.
  • For points, that are DaskDataFrame (and soon also DataFrame) and shapes, that are GeoDataFrame (and soon maybe also DaskGeoDataFrame), the user can choose to store metadata as additional columns of the dataframe, without having to add a new table. I wrote an helper function (get_values()), similar to scanpy.get.obs_df(), that allows to retrieve a column from the df, from obs or from var. So having information located either in the dataframe or in the table is not a problem. Further note, the implementation of get_values() would become much simpler with the new table representation.

Code example 1: 3 visium slides, one common table.

Let's see how the current code and the new code would compare.

Current approach

from spatialdata.models import TableModel, ShapesModel
from anndata import AnnData
import numpy as np
from spatialdata import SpatialData

# shapes
visium_locations0 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)
visium_locations1 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)
visium_locations2 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)

# shared table
adata = AnnData(np.random.rand(30, 20000))
adata.obs['region'] = (['visium0'] * 10 + ['visium1'] * 10 + ['visium2'] * 10).copy()
adata.obs['region'] = adata.obs['region'].astype('category')
adata.obs['instance_id'] = np.array(list(range(10)) + list(range(10)) + list(range(10)))
adata = TableModel.parse(adata, region=['visium0', 'visium1', 'visium2'], region_key='region', instance_key='instance_id')

sdata = SpatialData(shapes={'visium0': visium_locations0, 'visium1': visium_locations1, 'visium2': visium_locations2}, table=adata)
sdata

which gives

SpatialData object with:
├── Shapes
│     ├── 'visium0': GeoDataFrame shape: (10, 2) (2D shapes)
│     ├── 'visium1': GeoDataFrame shape: (10, 2) (2D shapes)
│     └── 'visium2': GeoDataFrame shape: (10, 2) (2D shapes)
└── Table
      └── AnnData object with n_obs × n_vars = 30 × 20000
    obs: 'region', 'instance_id'
    uns: 'spatialdata_attrs': AnnData (30, 20000)
with coordinate systems:
▸ 'global', with elements:
        visium0 (Shapes), visium1 (Shapes), visium2 (Shapes)

Common operations on the table, such as finding all the rows that corresponds to an element (such as visium1), sorted in the same order, are available via helper functions, in this case the following:

from spatialdata import match_table_to_element

match_table_to_element(sdata=sdata, element_name='visium1')

which gives

AnnData object with n_obs × n_vars = 10 × 20000
    obs: 'region', 'instance_id'
    uns: 'spatialdata_attrs'

But even a simple operation as this one requires laborious code (see here and the function called internally here). This requires lot of work for us to implement and maintain the code, and hinders the ability to the user to easily extend the codebase, or code a custom implementation to cover specific use cases.

In addition, I got told by some users that they don't feel natural/get the region, region_key, instance_key approach, and they end up not using that metadata.

New approach (pseudocode)

from spatialdata.models import TableModel, ShapesModel
from anndata import AnnData
import numpy as np
from spatialdata import SpatialData

# shapes
visium_locations0 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)
visium_locations1 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)
visium_locations2 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)

# shared table
adata0 = AnnData(np.random.rand(10, 20000))
adata1 = AnnData(np.random.rand(10, 20000))
adata2 = AnnData(np.random.rand(10, 20000))
sdata = SpatialData(
    shapes={
        "visium0": visium_locations0, 
        "visium1": visium_locations1, 
        "visium2": visium_locations2
    }, 
    tables={
        "visium0": adata0, 
        "visium1": adata1, 
        "visium2": adata2
    },
)
sdata

which is much simpler than the previous approach. The code would give

SpatialData object with:
└── Shapes
      ├── 'visium0': GeoDataFrame shape: (10, 2) (2D shapes)
      │   └── AnnData object with n_obs × n_vars = 10 × 20000
      ├── 'visium1': GeoDataFrame shape: (10, 2) (2D shapes)
      │   └── AnnData object with n_obs × n_vars = 10 × 20000
      └── 'visium2': GeoDataFrame shape: (10, 2) (2D shapes)
          └── AnnData object with n_obs × n_vars = 10 × 20000
with coordinate systems:
▸ 'global', with elements:
        visium0 (Shapes), visium1 (Shapes), visium2 (Shapes)

Getting the table corresponding to visium1, making the order of the rows match, does not require helper functions and would be as simple as

indices = sdata.shapes['visium1'].index
sdata.tables['visium1'][indices, :]

Concatenating and splitting

If the user needs to concatenate the various subtable to a unique one, a way to do this would be something like this (still a bit laborious but way less than using region, region_key and instance_key and not introducing a knowledge entry barrier to the user). Also, we could bundle it in an helper function.

Merging the table into a global one:

from anndata import concat

adata_full = concat((adata0, adata1, adata2), keys=['visium0', 'visium1', 'visium2'], index_unique='_')

Splitting back the merged table

adata0 = adata_full[adata_full.obs.index.map(lambda x: x.endswith('_visium0'))].copy()
adata0.obs.index = adata0.obs.index.map(lambda x: x.replace('_visium0', ''))

Code example 2: Visum + Xenium

Currently for representing the Visium + Xenium data we need multiple SpatialData objects and also using AnnData layers and .obsm to store the result of aggregation.
For instance this is the visium_roi_sdata object produced by the Visium + Xenium notebook "00":

SpatialData object with:
├── Shapes
│     └── 'CytAssist_FFPE_Human_Breast_Cancer': GeoDataFrame shape: (2826, 2) (2D shapes)
└── Table
      └── AnnData object with n_obs × n_vars = 2826 × 307
    obs: 'in_tissue', 'array_row', 'array_col', 'spot_id', 'region', 'dataset', 'clone'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'spatial', 'spatialdata_attrs'
    obsm: 'spatial', 'xe_rep1_celltype_major', 'xe_rep1_celltype_minor', 'xe_rep2_celltype_major', 'xe_rep2_celltype_minor'
    layers: 'xe_rep1_cells', 'xe_rep2_cells', 'xe_rep1_tx', 'xe_rep2_tx': AnnData (2826, 307)
with coordinate systems:
▸ 'aligned', with elements:
        CytAssist_FFPE_Human_Breast_Cancer (Shapes)

Note in particular that:

  • Along with this object we have two more SpatialData objects, one for the Xenium replicate 1 and 2.
  • the result of the aggregation of gene expression from Xenium cells to Visium circles from replicate 1 is saved in table.layers['xe_rep1_cells']
  • the result of the aggregation of transcripts from Xenium points to Visium circles from replicate 1 is saved in table.layers['xe_rep1_tx']
  • the result of the aggregation of cell type from Xenium cells to Visium circles (cell types fractions) from replicate 1, is saved in table.obsm['xe_rep1_celltype_major']

Using MuData tables would keep things more organized, something like this

SpatialData object with:
└── Shapes
    └── 'CytAssist_FFPE_Human_Breast_Cancer': GeoDataFrame shape: (2826, 2) (2D shapes)
        └── MuData object with n_obs × n_vars = 2826 × 1535.
            ├─── 'visum expression': AnnData object with n_obs × n_vars = 2826 × 307.
            ├─── 'xe_rep1_cells': AnnData object with n_obs × n_vars = 2826 × 307.
            ├─── 'xe_rep2_cells': AnnData object with n_obs × n_vars = 2826 × 307.
            ├─── 'xe_rep1_tx': AnnData object with n_obs × n_vars = 2826 × 307.
            ├─── 'xe_rep2_tx': AnnData object with n_obs × n_vars = 2826 × 307.
            ├─── 'xe_rep1_celltype_major': AnnData object with n_obs × n_vars = 2826 × 10.
            ├─── 'xe_rep2_celltype_major': AnnData object with n_obs × n_vars = 2826 × 10.
            ├─── 'xe_rep1_celltype_minor': AnnData object with n_obs × n_vars = 2826 × 10.
            └─── 'xe_rep2_celltype_minor': AnnData object with n_obs × n_vars = 2826 × 10.
with coordinate systems:
▸ 'aligned', with elements:
        CytAssist_FFPE_Human_Breast_Cancer (Shapes)

Notes:

  • now in the same object we could store also the Xenium data if we wanted. No need to use different SpatialData objects.
  • if we start with an implementation that doesn't use MuData but only AnnData, we can still obtain something similar to the above by duplicating the shapes layer. Suboptimal, but still the duplicated data has low memory impact compared to the iages, so it's a good temporary compromise.

Further considerations

Rows annotating multiple regions

Sometimes we would like to have rows of a table annotating multiple instances in different regions (e.g. the same cell represented as a circle, as a polygon or as a label). We had some discussions in #34 and #99 and on Zulip, but I came to the conclusion that the easiest solution is just to duplicate the table, and having the same table annotating the regions object. In particular this solves the problem of having different indices or having more or less instances in one regions objects. Matching indices can always be recomputed on the fly using geopandas.sjoin() or equivalent functions for raster data that we will need to write anyway to extend aggregate().

Subset by var/obs

One of the arguments to have one single table instead than multiple tables was to support syntaxes like this one (see more here #43)

idx = (sdata.table.obs.cell_type == "Neuron") & (sdata.obs.n_counts > 100)
sdata = sdata[idx, :]

But the reality is that we haven't implemented such function yet and that implementing these things can be really messy (as shown in the helper function discussed above to subset and reorder a table by a regions object). Furthermore the implementation would be made also complex by the fact that the cell_type information is not necessary found in the table, but could also be located in a column of the GeoDataFrame as discussed above.

The syntax above would instead simply become something like this, which also feels very natural

idx = (
    (sdata.tables['visium'].obs.cell_type == "Neuron") 
    & (sdata.tables['visium'].n_counts > 100)
)
sdata = SpatialData(
    shapes={"visium": sdata["visium"][idx]}, 
    tables={"visium": sdata.tables["visium"].loc[idx, :]},
)

Other linked issues

Linked discussion issues:

Issues that will be not relevant anymore with this new design:

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions