Skip to content

Celligner attributes to UMAP results driven almost entirely by PCA. #12

@a-dna-n

Description

@a-dna-n

Hi, there.

I'm trying to explain to other scientists what I learned about cancer thanks to UMAP. That includes insights from Celligner, which I recognized despite the unnecessary compression because of the nominal resemblance of the underlying map to the unbelievably informative representation that UMAP produces when it is given all the actual data instead of principal components.

Your article claims to run UMAP, as do the print statement and comments in your code, but, as with Seurat and ScanPy, "UMAP" turns out to mean that PCA is used to reduce the data linearly from thousands of dimensions to 70, and then, UMAP embeds those 70 principal components non-linearly in two dimensions. As with Seurat and ScanPy, it's impossible to understand the sequence of calls charitably, but unlike Seurat and ScanPy, it makes your own work appear to be less informative than it is.

I don't know how to categorize this request, except as an invitation to correct the false claims in your article and code.


├── celligner
│   ├── __init__.py
│   ├── limma.py
│   └── params.py
├── celligner_output.ipynb
├── install_submodules_and_run.sh
├── R
│   ├── Celligner_helpers.R
│   ├── Celligner_methods.R
│   ├── global_params.R
│   ├── install_packages.R
│   ├── mutlidataset_alignment.R
│   └── README.md
├── README.md
├── run_celligner_multi_dataset.py
├── run_celligner.py

In run_celligner.py, we find:

  # Compute UMAP, clusters and tumor - model distance
...
my_celligner.computeMetricsForOutput(model_ids=model_ids, tumor_ids=tumor_ids)

That function turns out to be in celligner/__init__.py, which is unusual in itself. What we first see is:


class Celligner(object):
    def __init__(
        self,
        topKGenes=TOP_K_GENES,
        pca_ncomp=PCA_NCOMP,
        cpca_ncomp=CPCA_NCOMP,
        louvain_kwargs=LOUVAIN_PARAMS,
        mnn_kwargs=MNN_PARAMS,
        umap_kwargs=UMAP_PARAMS,
        mnn_method="mnn_marioni",
        low_mem=False,
    ):
        """
        Initialize Celligner object

        Args:
            topKGenes (int, optional): see params.py. Defaults to 1000.
            pca_ncomp (int, optional): see params.py. Defaults to 70.
            cpca_ncomp (int, optional): see params.py. Defaults to 4.
            louvain_kwargs (dict, optional): see params.py
            mnn_kwargs (dict, optional): see params.py 
            umap_kwargs (dict, optional): see params.py
            mnn_method (str, optional): Only default "mnn_marioni" supported right now.
            low_mem (bool, optional): adviced if you have less than 32Gb of RAM. Defaults to False.
        """

In params.py, we find:

UMAP_PARAMS = {
    "n_neighbors": 10, # num nearest neighbors used to create UMAP plot
    "n_components": 2, 
    "metric": "euclidean", # distance metric used for the UMAP projection
    "min_dist": 0.5 # min distance used to create UMAP plot
}   

Upon returning to __init__.py and scrolling down several hundred lines, we eventually find the only rational explanation for all the hopping:


    def computeMetricsForOutput(self, umap_rand_seed=14, UMAP_only=False, model_ids=None, tumor_ids=None):
        """
        Compute UMAP embedding and optionally clusters and tumor - model distance.
        
        Args:
            UMAP_only (bool, optional): Only recompute the UMAP. Defaults to False.
            umap_rand_seed (int, optional): Set seed for UMAP, to try an alternative. Defaults to 14.
            model_ids (list, optional): model IDs for computing tumor-CL distance. Defaults to None, in which case the reference index is used.
            tumor_ids (list, optional): tumor IDs for computing tumor-CL distance. Defaults to None, in which case the target index is used.
        
        Raises:
            ValueError: if there is no corrected expression matrix
        """
        if self.combined_output is None:
            raise ValueError("No corrected expression matrix found, run this function after transform()")

        
        print("Computing UMAP embedding...")
        # Compute UMAP embedding for results
        pca = PCA(self.pca_ncomp)
        pcs = pca.fit_transform(self.combined_output)
        
        umap_reduced = umap.UMAP(**self.umap_kwargs, random_state=umap_rand_seed).fit_transform(pcs)
        self.umap_reduced = pd.DataFrame(umap_reduced, index=self.combined_output.index, columns=['umap1','umap2'])

If we revisit that first call, computeMetricsForOutput is an odd name for a function that always generates an embedding but sometimes calculates some distances.

 my_celligner.computeMetricsForOutput(model_ids=model_ids, tumor_ids=tumor_ids)

But really, this was already suggested in params.py:


# number of PCs to use for dimensionality reduction
PCA_NCOMP = 70 

# number of cPCA dimensions to regress out of the data
CPCA_NCOMP = 4

# @see https://scanpy.readthedocs.io/en/latest/generated/scanpy.tl.louvain.html
LOUVAIN_PARAMS = {
    "resolution": 5, # resolution parameter used for clustering the data
}

# For Mariona method (default)
MNN_PARAMS = {
    "k1": 5, # number of nearest neighbors of tumors in the cell line data
    "k2": 50, # number of nearest neighbors of cell lines in the tumor data
    "cosine_norm": False,
    "fk": 5 
}

UMAP_PARAMS = {
    "n_neighbors": 10, # num nearest neighbors used to create UMAP plot
    "n_components": 2, 
    "metric": "euclidean", # distance metric used for the UMAP projection
    "min_dist": 0.5 # min distance used to create UMAP plot
}

A UMAP plot is as real as a PCA plot - there's no such thing, except that the term devalues UMAP. The visualization is a scatter plot.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions