New multiple table in-memory design

# Scope of the proposal
After receiving feedback from `@EliHei2 @lopollar @melonora @LLehner @AdemSaglamRB`, hearing users asking for multiple table support as an important requirement for the adoption of `SpatialData` in their tools (`Maks, Louis`) and having personally examined the development implications of the current table specification, I propose a new **in-memory** design for supporting multiple tables in `SpatialData`, that I have been maturing in the past months. Importantly, **the on-disk table specification will not change**.

## Timeline and feedback
I propose to implement the new in-memory representation after the following work: aggregation and query refactoring (me), io refactoring (@giovp), new pipeline for daily tasting all the notebook (me), synthetic datasets for R/JS interoperability (me), napari bugfix (me and others). So ideally we would start in 1-1.5 month, enough time for receiving feedback and improving the design. In particular I kindly ask for feedback to @kevinyamauchi @ivirshup @gtca @ilia-kats. I put some usernames in `raw` not to spam them now, but I will tag everybody after the first draft is refined, and additionally `@timtreis @berombau @joshmoore`.

## Disk specification
The disk storage will remain the one described in https://github.com/ome/ngff/pull/64. What I propose to change is the in-memory representation of the data: instead of directly representing in-memory the metadata that we have on-disk, I propose to put the data in a form that is more ergonomic and that follows the `(geo)pandas`, `anndata` and `muon` indexing common practices.

## In-memory representation
I propose the following.
* To stop having one single table shared for the `SpatialData` object and instead allowing, for each object, to have either an `AnnData` table, either a `MuData` table associated (or also having no table associated to it). For keeping the implementation smaller, I would start just with one `AnnData` table, and allow to use also a `MuData` table in a follow up PR. On disk, the `MuData` table would be saved as a list of tables, so that the current specs are still valid.
* To not represent `region`, `region_key`, `instance_key` in-memory, and instead relying on the same shared indexing between the table and the (single) corresponding shapes/labels element. Code examples below. This may require allowing `AnnData` to use custom indices and not just string (or at least integers), or would force the (geo)dataframes to have string indices. We will find a way around this.
* As a consequence of the two points above I want to stress that this would disallow a table to refer to multiple shapes/labels elements. If the user wants a single table annotating multiple elements, the user would need to merge the table, and then split it again to save it. I will show code example for doing this later. I believe that the performance implication of having to do this manual merge and split are negligible compared to the other types of data operations that we have (query, aggregation, etc), and since the current approach requires a lot of internal splits and merges anyways.
* For points, that are `DaskDataFrame` (and soon also `DataFrame`) and shapes, that are `GeoDataFrame` (and soon maybe also `DaskGeoDataFrame`), the user can choose to store metadata as additional columns of the dataframe, without having to add a new table. I wrote an helper function ([`get_values()`](https://github.com/scverse/spatialdata/blob/2bd82d73243fc1337dc15d95a110b51be84576eb/src/spatialdata/_core/query/relational_query.py#L207)), similar to `scanpy.get.obs_df()`, that allows to retrieve a column from the df, from obs or from var. So having information located either in the dataframe or in the table is not a problem. Further note, the implementation of `get_values()` would become much simpler with the new table representation.

## Code example 1: 3 visium slides, one common table.
Let's see how the current code and the new code would compare.
### Current approach
```python
from spatialdata.models import TableModel, ShapesModel
from anndata import AnnData
import numpy as np
from spatialdata import SpatialData

# shapes
visium_locations0 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)
visium_locations1 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)
visium_locations2 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)

# shared table
adata = AnnData(np.random.rand(30, 20000))
adata.obs['region'] = (['visium0'] * 10 + ['visium1'] * 10 + ['visium2'] * 10).copy()
adata.obs['region'] = adata.obs['region'].astype('category')
adata.obs['instance_id'] = np.array(list(range(10)) + list(range(10)) + list(range(10)))
adata = TableModel.parse(adata, region=['visium0', 'visium1', 'visium2'], region_key='region', instance_key='instance_id')

sdata = SpatialData(shapes={'visium0': visium_locations0, 'visium1': visium_locations1, 'visium2': visium_locations2}, table=adata)
sdata
```
which gives
```
SpatialData object with:
├── Shapes
│     ├── 'visium0': GeoDataFrame shape: (10, 2) (2D shapes)
│     ├── 'visium1': GeoDataFrame shape: (10, 2) (2D shapes)
│     └── 'visium2': GeoDataFrame shape: (10, 2) (2D shapes)
└── Table
      └── AnnData object with n_obs × n_vars = 30 × 20000
    obs: 'region', 'instance_id'
    uns: 'spatialdata_attrs': AnnData (30, 20000)
with coordinate systems:
▸ 'global', with elements:
        visium0 (Shapes), visium1 (Shapes), visium2 (Shapes)
```
Common operations on the table, such as finding all the rows that corresponds to an element (such as `visium1`), sorted in the same order, are available via helper functions, in this case the following:
```python
from spatialdata import match_table_to_element

match_table_to_element(sdata=sdata, element_name='visium1')
```
which gives
```
AnnData object with n_obs × n_vars = 10 × 20000
    obs: 'region', 'instance_id'
    uns: 'spatialdata_attrs'
```
But even a simple operation as this one requires laborious code ([see here](https://github.com/scverse/spatialdata/blob/2bd82d73243fc1337dc15d95a110b51be84576eb/src/spatialdata/_core/query/relational_query.py#LL134C12-L134C12) and the function called internally [here](https://github.com/scverse/spatialdata/blob/2bd82d73243fc1337dc15d95a110b51be84576eb/src/spatialdata/_core/query/relational_query.py#L55)). This requires lot of work for us to implement and maintain the code, and hinders the ability to the user to easily extend the codebase, or code a custom implementation to cover specific use cases. 

In addition, I got told by some users that they don't feel natural/get the `region`, `region_key`, `instance_key` approach, and they end up not using that metadata.

### New approach (pseudocode)
```python
from spatialdata.models import TableModel, ShapesModel
from anndata import AnnData
import numpy as np
from spatialdata import SpatialData

# shapes
visium_locations0 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)
visium_locations1 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)
visium_locations2 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)

# shared table
adata0 = AnnData(np.random.rand(10, 20000))
adata1 = AnnData(np.random.rand(10, 20000))
adata2 = AnnData(np.random.rand(10, 20000))
sdata = SpatialData(
    shapes={
        "visium0": visium_locations0, 
        "visium1": visium_locations1, 
        "visium2": visium_locations2
    }, 
    tables={
        "visium0": adata0, 
        "visium1": adata1, 
        "visium2": adata2
    },
)
sdata
```
which is much simpler than the previous approach. The code would give
```
SpatialData object with:
└── Shapes
      ├── 'visium0': GeoDataFrame shape: (10, 2) (2D shapes)
      │   └── AnnData object with n_obs × n_vars = 10 × 20000
      ├── 'visium1': GeoDataFrame shape: (10, 2) (2D shapes)
      │   └── AnnData object with n_obs × n_vars = 10 × 20000
      └── 'visium2': GeoDataFrame shape: (10, 2) (2D shapes)
          └── AnnData object with n_obs × n_vars = 10 × 20000
with coordinate systems:
▸ 'global', with elements:
        visium0 (Shapes), visium1 (Shapes), visium2 (Shapes)
```

Getting the table corresponding to `visium1`, making the order of the rows match, does not require helper functions and would be as simple as 
```python
indices = sdata.shapes['visium1'].index
sdata.tables['visium1'][indices, :]
```

#### Concatenating and splitting
If the user needs to concatenate the various subtable to a unique one, a way to do this would be something like this (still a bit laborious but way less than using `region`, `region_key` and `instance_key` and not introducing a knowledge entry barrier to the user). Also, we could bundle it in an helper function.

Merging the table into a global one:
```python
from anndata import concat

adata_full = concat((adata0, adata1, adata2), keys=['visium0', 'visium1', 'visium2'], index_unique='_')
```

Splitting back the merged table
```python
adata0 = adata_full[adata_full.obs.index.map(lambda x: x.endswith('_visium0'))].copy()
adata0.obs.index = adata0.obs.index.map(lambda x: x.replace('_visium0', ''))
```

## Code example 2: Visum + Xenium
Currently for representing the Visium + Xenium data we need multiple `SpatialData` objects and also using `AnnData` layers and `.obsm` to store the result of aggregation.
For instance this is the `visium_roi_sdata` object produced by the [Visium + Xenium notebook "00"](https://github.com/scverse/spatialdata-notebooks/blob/main/notebooks/paper_reproducibility/00_xenium_and_visium.ipynb):
```
SpatialData object with:
├── Shapes
│     └── 'CytAssist_FFPE_Human_Breast_Cancer': GeoDataFrame shape: (2826, 2) (2D shapes)
└── Table
      └── AnnData object with n_obs × n_vars = 2826 × 307
    obs: 'in_tissue', 'array_row', 'array_col', 'spot_id', 'region', 'dataset', 'clone'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'spatial', 'spatialdata_attrs'
    obsm: 'spatial', 'xe_rep1_celltype_major', 'xe_rep1_celltype_minor', 'xe_rep2_celltype_major', 'xe_rep2_celltype_minor'
    layers: 'xe_rep1_cells', 'xe_rep2_cells', 'xe_rep1_tx', 'xe_rep2_tx': AnnData (2826, 307)
with coordinate systems:
▸ 'aligned', with elements:
        CytAssist_FFPE_Human_Breast_Cancer (Shapes)
```

Note in particular that:
- Along with this object we have two more `SpatialData` objects, one for the Xenium replicate 1 and 2.
- the result of the aggregation of gene expression from Xenium cells to Visium circles from replicate 1 is saved in `table.layers['xe_rep1_cells']`
- the result of the aggregation of transcripts from Xenium points to Visium circles from replicate 1 is saved in `table.layers['xe_rep1_tx']`
- the result of the aggregation of cell type from Xenium cells to Visium circles (cell types fractions) from replicate 1, is saved in `table.obsm['xe_rep1_celltype_major']`

Using `MuData` tables would keep things more organized, something like this
```
SpatialData object with:
└── Shapes
    └── 'CytAssist_FFPE_Human_Breast_Cancer': GeoDataFrame shape: (2826, 2) (2D shapes)
        └── MuData object with n_obs × n_vars = 2826 × 1535.
            ├─── 'visum expression': AnnData object with n_obs × n_vars = 2826 × 307.
            ├─── 'xe_rep1_cells': AnnData object with n_obs × n_vars = 2826 × 307.
            ├─── 'xe_rep2_cells': AnnData object with n_obs × n_vars = 2826 × 307.
            ├─── 'xe_rep1_tx': AnnData object with n_obs × n_vars = 2826 × 307.
            ├─── 'xe_rep2_tx': AnnData object with n_obs × n_vars = 2826 × 307.
            ├─── 'xe_rep1_celltype_major': AnnData object with n_obs × n_vars = 2826 × 10.
            ├─── 'xe_rep2_celltype_major': AnnData object with n_obs × n_vars = 2826 × 10.
            ├─── 'xe_rep1_celltype_minor': AnnData object with n_obs × n_vars = 2826 × 10.
            └─── 'xe_rep2_celltype_minor': AnnData object with n_obs × n_vars = 2826 × 10.
with coordinate systems:
▸ 'aligned', with elements:
        CytAssist_FFPE_Human_Breast_Cancer (Shapes)
```

Notes:
- now in the same object we could store also the Xenium data if we wanted. No need to use different `SpatialData` objects.
- if we start with an implementation that doesn't use `MuData` but only `AnnData`, we can still obtain something similar to the above by duplicating the `shapes` layer. Suboptimal, but still the duplicated data has low memory impact compared to the iages, so it's a good temporary compromise.

## Further considerations
### Rows annotating multiple regions
Sometimes we would like to have rows of a table annotating multiple instances in different regions (e.g. the same cell represented as a circle, as a polygon or as a label). We had some discussions in https://github.com/scverse/spatialdata/issues/34 and https://github.com/scverse/spatialdata/issues/99 and on Zulip, but I came to the conclusion that the easiest solution is just to duplicate the table, and having the same table annotating the regions object. In particular this solves the problem of having different indices or having more or less instances in one regions objects. Matching indices can always be recomputed on the fly using `geopandas.sjoin()` or equivalent functions for raster data that we will need to write anyway to extend `aggregate()`.

### Subset by var/obs
One of the arguments to have one single table instead than multiple tables was to support syntaxes like this one (see more here https://github.com/scverse/spatialdata/issues/43)
```python
idx = (sdata.table.obs.cell_type == "Neuron") & (sdata.obs.n_counts > 100)
sdata = sdata[idx, :]
```
But the reality is that we haven't implemented such function yet and that implementing these things can be really messy (as shown in [the helper function discussed above](https://github.com/scverse/spatialdata/blob/2bd82d73243fc1337dc15d95a110b51be84576eb/src/spatialdata/_core/query/relational_query.py#L55) to subset and reorder a table by a regions object). Furthermore the implementation would be made also complex by the fact that the `cell_type` information is not necessary found in the table, but could also be located in a column of the `GeoDataFrame` as discussed above. 

The syntax above would instead simply become something like this, which also feels very natural
```python
idx = (
    (sdata.tables['visium'].obs.cell_type == "Neuron") 
    & (sdata.tables['visium'].n_counts > 100)
)
sdata = SpatialData(
    shapes={"visium": sdata["visium"][idx]}, 
    tables={"visium": sdata.tables["visium"].loc[idx, :]},
)
```

## Other linked issues
Linked discussion issues: 
- https://github.com/scverse/spatialdata/issues/270

Issues that will be not relevant anymore with this new design:
- https://github.com/scverse/spatialdata/issues/267 reason: the new shared indexing design will improve the cross-talk between regions and the table
- https://github.com/scverse/spatialdata/issues/45 reason: we keep geometries and annotations decoupled, but thanks to the shared indexing coupling and decoupling becomes a trivial operation 
- https://github.com/scverse/spatialdata/issues/130 reason: we drop the `region`, `region_key` and `instance_key` metadata for the in-memory representation (they will be used for IO only).
- most of the checkboxes of https://github.com/scverse/spatialdata/issues/218 reason: in-memory the only relationships that are kept are by shared indexing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New multiple table in-memory design #298

Scope of the proposal

Timeline and feedback

Disk specification

In-memory representation

Code example 1: 3 visium slides, one common table.

Current approach

New approach (pseudocode)

Concatenating and splitting

Code example 2: Visum + Xenium

Further considerations

Rows annotating multiple regions

Subset by var/obs

Other linked issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New multiple table in-memory design #298

Description

Scope of the proposal

Timeline and feedback

Disk specification

In-memory representation

Code example 1: 3 visium slides, one common table.

Current approach

New approach (pseudocode)

Concatenating and splitting

Code example 2: Visum + Xenium

Further considerations

Rows annotating multiple regions

Subset by var/obs

Other linked issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions