-
Notifications
You must be signed in to change notification settings - Fork 79
Description
Scope of the proposal
After receiving feedback from @EliHei2 @lopollar @melonora @LLehner @AdemSaglamRB, hearing users asking for multiple table support as an important requirement for the adoption of SpatialData in their tools (Maks, Louis) and having personally examined the development implications of the current table specification, I propose a new in-memory design for supporting multiple tables in SpatialData, that I have been maturing in the past months. Importantly, the on-disk table specification will not change.
Timeline and feedback
I propose to implement the new in-memory representation after the following work: aggregation and query refactoring (me), io refactoring (@giovp), new pipeline for daily tasting all the notebook (me), synthetic datasets for R/JS interoperability (me), napari bugfix (me and others). So ideally we would start in 1-1.5 month, enough time for receiving feedback and improving the design. In particular I kindly ask for feedback to @kevinyamauchi @ivirshup @gtca @ilia-kats. I put some usernames in raw not to spam them now, but I will tag everybody after the first draft is refined, and additionally @timtreis @berombau @joshmoore.
Disk specification
The disk storage will remain the one described in ome/ngff#64. What I propose to change is the in-memory representation of the data: instead of directly representing in-memory the metadata that we have on-disk, I propose to put the data in a form that is more ergonomic and that follows the (geo)pandas, anndata and muon indexing common practices.
In-memory representation
I propose the following.
- To stop having one single table shared for the
SpatialDataobject and instead allowing, for each object, to have either anAnnDatatable, either aMuDatatable associated (or also having no table associated to it). For keeping the implementation smaller, I would start just with oneAnnDatatable, and allow to use also aMuDatatable in a follow up PR. On disk, theMuDatatable would be saved as a list of tables, so that the current specs are still valid. - To not represent
region,region_key,instance_keyin-memory, and instead relying on the same shared indexing between the table and the (single) corresponding shapes/labels element. Code examples below. This may require allowingAnnDatato use custom indices and not just string (or at least integers), or would force the (geo)dataframes to have string indices. We will find a way around this. - As a consequence of the two points above I want to stress that this would disallow a table to refer to multiple shapes/labels elements. If the user wants a single table annotating multiple elements, the user would need to merge the table, and then split it again to save it. I will show code example for doing this later. I believe that the performance implication of having to do this manual merge and split are negligible compared to the other types of data operations that we have (query, aggregation, etc), and since the current approach requires a lot of internal splits and merges anyways.
- For points, that are
DaskDataFrame(and soon alsoDataFrame) and shapes, that areGeoDataFrame(and soon maybe alsoDaskGeoDataFrame), the user can choose to store metadata as additional columns of the dataframe, without having to add a new table. I wrote an helper function (get_values()), similar toscanpy.get.obs_df(), that allows to retrieve a column from the df, from obs or from var. So having information located either in the dataframe or in the table is not a problem. Further note, the implementation ofget_values()would become much simpler with the new table representation.
Code example 1: 3 visium slides, one common table.
Let's see how the current code and the new code would compare.
Current approach
from spatialdata.models import TableModel, ShapesModel
from anndata import AnnData
import numpy as np
from spatialdata import SpatialData
# shapes
visium_locations0 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)
visium_locations1 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)
visium_locations2 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)
# shared table
adata = AnnData(np.random.rand(30, 20000))
adata.obs['region'] = (['visium0'] * 10 + ['visium1'] * 10 + ['visium2'] * 10).copy()
adata.obs['region'] = adata.obs['region'].astype('category')
adata.obs['instance_id'] = np.array(list(range(10)) + list(range(10)) + list(range(10)))
adata = TableModel.parse(adata, region=['visium0', 'visium1', 'visium2'], region_key='region', instance_key='instance_id')
sdata = SpatialData(shapes={'visium0': visium_locations0, 'visium1': visium_locations1, 'visium2': visium_locations2}, table=adata)
sdatawhich gives
SpatialData object with:
├── Shapes
│ ├── 'visium0': GeoDataFrame shape: (10, 2) (2D shapes)
│ ├── 'visium1': GeoDataFrame shape: (10, 2) (2D shapes)
│ └── 'visium2': GeoDataFrame shape: (10, 2) (2D shapes)
└── Table
└── AnnData object with n_obs × n_vars = 30 × 20000
obs: 'region', 'instance_id'
uns: 'spatialdata_attrs': AnnData (30, 20000)
with coordinate systems:
▸ 'global', with elements:
visium0 (Shapes), visium1 (Shapes), visium2 (Shapes)
Common operations on the table, such as finding all the rows that corresponds to an element (such as visium1), sorted in the same order, are available via helper functions, in this case the following:
from spatialdata import match_table_to_element
match_table_to_element(sdata=sdata, element_name='visium1')which gives
AnnData object with n_obs × n_vars = 10 × 20000
obs: 'region', 'instance_id'
uns: 'spatialdata_attrs'
But even a simple operation as this one requires laborious code (see here and the function called internally here). This requires lot of work for us to implement and maintain the code, and hinders the ability to the user to easily extend the codebase, or code a custom implementation to cover specific use cases.
In addition, I got told by some users that they don't feel natural/get the region, region_key, instance_key approach, and they end up not using that metadata.
New approach (pseudocode)
from spatialdata.models import TableModel, ShapesModel
from anndata import AnnData
import numpy as np
from spatialdata import SpatialData
# shapes
visium_locations0 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)
visium_locations1 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)
visium_locations2 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)
# shared table
adata0 = AnnData(np.random.rand(10, 20000))
adata1 = AnnData(np.random.rand(10, 20000))
adata2 = AnnData(np.random.rand(10, 20000))
sdata = SpatialData(
shapes={
"visium0": visium_locations0,
"visium1": visium_locations1,
"visium2": visium_locations2
},
tables={
"visium0": adata0,
"visium1": adata1,
"visium2": adata2
},
)
sdatawhich is much simpler than the previous approach. The code would give
SpatialData object with:
└── Shapes
├── 'visium0': GeoDataFrame shape: (10, 2) (2D shapes)
│ └── AnnData object with n_obs × n_vars = 10 × 20000
├── 'visium1': GeoDataFrame shape: (10, 2) (2D shapes)
│ └── AnnData object with n_obs × n_vars = 10 × 20000
└── 'visium2': GeoDataFrame shape: (10, 2) (2D shapes)
└── AnnData object with n_obs × n_vars = 10 × 20000
with coordinate systems:
▸ 'global', with elements:
visium0 (Shapes), visium1 (Shapes), visium2 (Shapes)
Getting the table corresponding to visium1, making the order of the rows match, does not require helper functions and would be as simple as
indices = sdata.shapes['visium1'].index
sdata.tables['visium1'][indices, :]Concatenating and splitting
If the user needs to concatenate the various subtable to a unique one, a way to do this would be something like this (still a bit laborious but way less than using region, region_key and instance_key and not introducing a knowledge entry barrier to the user). Also, we could bundle it in an helper function.
Merging the table into a global one:
from anndata import concat
adata_full = concat((adata0, adata1, adata2), keys=['visium0', 'visium1', 'visium2'], index_unique='_')Splitting back the merged table
adata0 = adata_full[adata_full.obs.index.map(lambda x: x.endswith('_visium0'))].copy()
adata0.obs.index = adata0.obs.index.map(lambda x: x.replace('_visium0', ''))Code example 2: Visum + Xenium
Currently for representing the Visium + Xenium data we need multiple SpatialData objects and also using AnnData layers and .obsm to store the result of aggregation.
For instance this is the visium_roi_sdata object produced by the Visium + Xenium notebook "00":
SpatialData object with:
├── Shapes
│ └── 'CytAssist_FFPE_Human_Breast_Cancer': GeoDataFrame shape: (2826, 2) (2D shapes)
└── Table
└── AnnData object with n_obs × n_vars = 2826 × 307
obs: 'in_tissue', 'array_row', 'array_col', 'spot_id', 'region', 'dataset', 'clone'
var: 'gene_ids', 'feature_types', 'genome'
uns: 'spatial', 'spatialdata_attrs'
obsm: 'spatial', 'xe_rep1_celltype_major', 'xe_rep1_celltype_minor', 'xe_rep2_celltype_major', 'xe_rep2_celltype_minor'
layers: 'xe_rep1_cells', 'xe_rep2_cells', 'xe_rep1_tx', 'xe_rep2_tx': AnnData (2826, 307)
with coordinate systems:
▸ 'aligned', with elements:
CytAssist_FFPE_Human_Breast_Cancer (Shapes)
Note in particular that:
- Along with this object we have two more
SpatialDataobjects, one for the Xenium replicate 1 and 2. - the result of the aggregation of gene expression from Xenium cells to Visium circles from replicate 1 is saved in
table.layers['xe_rep1_cells'] - the result of the aggregation of transcripts from Xenium points to Visium circles from replicate 1 is saved in
table.layers['xe_rep1_tx'] - the result of the aggregation of cell type from Xenium cells to Visium circles (cell types fractions) from replicate 1, is saved in
table.obsm['xe_rep1_celltype_major']
Using MuData tables would keep things more organized, something like this
SpatialData object with:
└── Shapes
└── 'CytAssist_FFPE_Human_Breast_Cancer': GeoDataFrame shape: (2826, 2) (2D shapes)
└── MuData object with n_obs × n_vars = 2826 × 1535.
├─── 'visum expression': AnnData object with n_obs × n_vars = 2826 × 307.
├─── 'xe_rep1_cells': AnnData object with n_obs × n_vars = 2826 × 307.
├─── 'xe_rep2_cells': AnnData object with n_obs × n_vars = 2826 × 307.
├─── 'xe_rep1_tx': AnnData object with n_obs × n_vars = 2826 × 307.
├─── 'xe_rep2_tx': AnnData object with n_obs × n_vars = 2826 × 307.
├─── 'xe_rep1_celltype_major': AnnData object with n_obs × n_vars = 2826 × 10.
├─── 'xe_rep2_celltype_major': AnnData object with n_obs × n_vars = 2826 × 10.
├─── 'xe_rep1_celltype_minor': AnnData object with n_obs × n_vars = 2826 × 10.
└─── 'xe_rep2_celltype_minor': AnnData object with n_obs × n_vars = 2826 × 10.
with coordinate systems:
▸ 'aligned', with elements:
CytAssist_FFPE_Human_Breast_Cancer (Shapes)
Notes:
- now in the same object we could store also the Xenium data if we wanted. No need to use different
SpatialDataobjects. - if we start with an implementation that doesn't use
MuDatabut onlyAnnData, we can still obtain something similar to the above by duplicating theshapeslayer. Suboptimal, but still the duplicated data has low memory impact compared to the iages, so it's a good temporary compromise.
Further considerations
Rows annotating multiple regions
Sometimes we would like to have rows of a table annotating multiple instances in different regions (e.g. the same cell represented as a circle, as a polygon or as a label). We had some discussions in #34 and #99 and on Zulip, but I came to the conclusion that the easiest solution is just to duplicate the table, and having the same table annotating the regions object. In particular this solves the problem of having different indices or having more or less instances in one regions objects. Matching indices can always be recomputed on the fly using geopandas.sjoin() or equivalent functions for raster data that we will need to write anyway to extend aggregate().
Subset by var/obs
One of the arguments to have one single table instead than multiple tables was to support syntaxes like this one (see more here #43)
idx = (sdata.table.obs.cell_type == "Neuron") & (sdata.obs.n_counts > 100)
sdata = sdata[idx, :]But the reality is that we haven't implemented such function yet and that implementing these things can be really messy (as shown in the helper function discussed above to subset and reorder a table by a regions object). Furthermore the implementation would be made also complex by the fact that the cell_type information is not necessary found in the table, but could also be located in a column of the GeoDataFrame as discussed above.
The syntax above would instead simply become something like this, which also feels very natural
idx = (
(sdata.tables['visium'].obs.cell_type == "Neuron")
& (sdata.tables['visium'].n_counts > 100)
)
sdata = SpatialData(
shapes={"visium": sdata["visium"][idx]},
tables={"visium": sdata.tables["visium"].loc[idx, :]},
)Other linked issues
Linked discussion issues:
Issues that will be not relevant anymore with this new design:
- Crosstalk between shapes layer and table #267 reason: the new shared indexing design will improve the cross-talk between regions and the table
- Discussion: an element that is both a Circles/Points and Table at the same time #45 reason: we keep geometries and annotations decoupled, but thanks to the shared indexing coupling and decoupling becomes a trivial operation
- Check if the spatialdata_attrs are valid in
Table.validate()#130 reason: we drop theregion,region_keyandinstance_keymetadata for the in-memory representation (they will be used for IO only). - most of the checkboxes of Method to validate the relationship between elements #218 reason: in-memory the only relationships that are kept are by shared indexing