Celligner attributes to UMAP results driven almost entirely by PCA.

Hi, there.

I'm trying to explain to other scientists what I learned about cancer thanks to UMAP. That includes insights from Celligner, which I recognized despite the unnecessary compression because of the nominal resemblance of the underlying map to the unbelievably informative representation that UMAP produces when it is given all the actual data instead of principal components.

Your article claims to run UMAP, as do the print statement and comments in your code, but, as with Seurat and ScanPy, "UMAP" turns out to mean that PCA is used to reduce the data **linearly** from thousands of dimensions to 70, and then, UMAP embeds those 70 principal components non-linearly in two dimensions. As with Seurat and ScanPy, it's impossible to understand the sequence of calls charitably, but unlike Seurat and ScanPy, it makes your own work appear to be less informative than it is.

I don't know how to categorize this request, except as an invitation to correct the false claims in your article and code.

<pre><code>
├── celligner
│   ├── __init__.py
│   ├── limma.py
│   └── params.py
├── celligner_output.ipynb
├── install_submodules_and_run.sh
├── R
│   ├── Celligner_helpers.R
│   ├── Celligner_methods.R
│   ├── global_params.R
│   ├── install_packages.R
│   ├── mutlidataset_alignment.R
│   └── README.md
├── README.md
├── run_celligner_multi_dataset.py
├── run_celligner.py
</code></pre>

In `run_celligner.py`, we find:

 ```
 # Compute UMAP, clusters and tumor - model distance
...
 my_celligner.computeMetricsForOutput(model_ids=model_ids, tumor_ids=tumor_ids)
```

That function turns out to be in `celligner/__init__.py`, which is unusual in itself. What we first see is:

<pre><code>
class Celligner(object):
 def __init__(
 self,
 topKGenes=TOP_K_GENES,
 pca_ncomp=PCA_NCOMP,
 cpca_ncomp=CPCA_NCOMP,
 louvain_kwargs=LOUVAIN_PARAMS,
 mnn_kwargs=MNN_PARAMS,
 umap_kwargs=UMAP_PARAMS,
 mnn_method="mnn_marioni",
 low_mem=False,
 ):
 """
 Initialize Celligner object

 Args:
 topKGenes (int, optional): see params.py. Defaults to 1000.
 pca_ncomp (int, optional): see params.py. Defaults to 70.
 cpca_ncomp (int, optional): see params.py. Defaults to 4.
 louvain_kwargs (dict, optional): see params.py
 mnn_kwargs (dict, optional): see params.py 
 umap_kwargs (dict, optional): see params.py
 mnn_method (str, optional): Only default "mnn_marioni" supported right now.
 low_mem (bool, optional): adviced if you have less than 32Gb of RAM. Defaults to False.
 """
</code></pre>

In `params.py`, we find:

```
UMAP_PARAMS = {
 "n_neighbors": 10, # num nearest neighbors used to create UMAP plot
 "n_components": 2, 
 "metric": "euclidean", # distance metric used for the UMAP projection
 "min_dist": 0.5 # min distance used to create UMAP plot
} 
```

Upon returning to `__init__.py` and scrolling down several hundred lines, we eventually find the only rational explanation for all the hopping:

<pre><code>
 def computeMetricsForOutput(self, umap_rand_seed=14, UMAP_only=False, model_ids=None, tumor_ids=None):
 """
 Compute UMAP embedding and optionally clusters and tumor - model distance.
 
 Args:
 UMAP_only (bool, optional): Only recompute the UMAP. Defaults to False.
 umap_rand_seed (int, optional): Set seed for UMAP, to try an alternative. Defaults to 14.
 model_ids (list, optional): model IDs for computing tumor-CL distance. Defaults to None, in which case the reference index is used.
 tumor_ids (list, optional): tumor IDs for computing tumor-CL distance. Defaults to None, in which case the target index is used.
 
 Raises:
 ValueError: if there is no corrected expression matrix
 """
 if self.combined_output is None:
 raise ValueError("No corrected expression matrix found, run this function after transform()")

 
 print("Computing UMAP embedding...")
 # Compute UMAP embedding for results
 pca = PCA(self.pca_ncomp)
 pcs = pca.fit_transform(self.combined_output)
 
 umap_reduced = umap.UMAP(**self.umap_kwargs, random_state=umap_rand_seed).fit_transform(pcs)
 self.umap_reduced = pd.DataFrame(umap_reduced, index=self.combined_output.index, columns=['umap1','umap2'])
</code></pre>

If we revisit that first call, `computeMetricsForOutput` is an odd name for a function that always generates an embedding but sometimes calculates some distances. 

```
 my_celligner.computeMetricsForOutput(model_ids=model_ids, tumor_ids=tumor_ids)
```

But really, this was already suggested in `params.py`:

<pre><code>
# number of PCs to use for dimensionality reduction
PCA_NCOMP = 70 

# number of cPCA dimensions to regress out of the data
CPCA_NCOMP = 4

# @see https://scanpy.readthedocs.io/en/latest/generated/scanpy.tl.louvain.html
LOUVAIN_PARAMS = {
 "resolution": 5, # resolution parameter used for clustering the data
}

# For Mariona method (default)
MNN_PARAMS = {
 "k1": 5, # number of nearest neighbors of tumors in the cell line data
 "k2": 50, # number of nearest neighbors of cell lines in the tumor data
 "cosine_norm": False,
 "fk": 5 
}

UMAP_PARAMS = {
 "n_neighbors": 10, # num nearest neighbors used to create UMAP plot
 "n_components": 2, 
 "metric": "euclidean", # distance metric used for the UMAP projection
 "min_dist": 0.5 # min distance used to create UMAP plot
}
</pre></code>

A UMAP plot is as real as a PCA plot - there's no such thing, except that the term devalues UMAP. The visualization is a scatter plot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Celligner attributes to UMAP results driven almost entirely by PCA. #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Celligner attributes to UMAP results driven almost entirely by PCA. #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions