Skip to content

Large matrix errors (more than 2^31-1 non-zero entries) [on large datasets] #29

@ghost

Description

I'm working with a matrix of over 200,000 cells and 36,000 genes.

"I first tried the 'RunALRA' function in Seurat. Then, I extracted the expression table, converted it into a matrix, and attempted to use ALRA (including alra.low.memory), but encountered the following error.

"Attempting to construct a sparseMatrix with at least 2^31-1 non-zero entries."

It appears that the dgCMatrix conversion process fails because a large matrix exceeds the limit. Modifying the ALRA function code to use a general matrix format instead of dgCMatrix is possible, but operating it realistically is challenging due to the near 100Gb size

If there is a function or method to address this issue with large datasets like mine, I would appreciate any suggestions. Below are the alternatives I am currently considering. I would be grateful if you could share your opinions on them as well.

Currently, I am considering the following three alternatives :

For Alternative A, Imputation is performed for each sample and integrated into one. However, based on the experiences of other users registered in this issue, it seems that normalizing and imputing the integrated data yields more accurate results.

For Alternative B, After normalization is performed on the integrated data, imputation is performed by reducing the number of genes. However, there may be different trends compared to when imputation is performed with the entire gene.

For Alternative C, (If celltype information is known) Immediately perform normalization on the integrated data and then perform subsetting for each celltype to separate them. Imputation is then performed for each cell type and then integrated again. I think this alternative has the advantage of allowing the use of any gene. Additionally, certain genes may not be expressed at all or may be expressed only in certain cell types. I hope that the biological perspective that it can be expressed differently only in certain cells can be applied. Additionally, since I have performed normalization for the entire cell population, so I believe it will be possible to compare the expression levels between cell types in the integrated data after conducting ALRA Imputation for each cell type. If there are any suggestions for revising my thoughts, I would appreciate hearing them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions