The GeneSpectra module performs gene classification using scRNA-seq data.
Analysis steps provided in this package:
- Reduce sparsity by creating metacells or pseudobulking
- Normalise data and filter low-count genes
- Multi-thread gene classification for gene specificity and distribution
- Visualisation of gene classification results
- Compare ortholog classes between species and generate the gene class conservation heatmap
Note that the gene classes are modified based on Human Protein Atlas classifications by Karlsson, M. et al.
First pull source code from the repository:
git clone https://github.com/Papatheodorou-Group/GeneSpectra.git
cd GeneSpectraPixi is used for dependency management.
First install pixi. Then, run this command in the GeneSpectra/ directory to install project dependencies:
pixi install -aNote that the core gene classification code in GeneSpectra technically works on very basic Python and can be adapted to other environments.
Installation should take about 5-10 minutes, mostly for conda to download packages.
Wrapper functions and helper functions to use metacells to create metacells based on scRNA-seq data. It is also recommended to follow the official metacells workflow to create the most tailored metacells anndata object (use the iterative vignette for brand-new data), as you have more freedom to adjust various parameters. Alternatively, when the dataset is unsuitable for metacell calculation, merge cells of the same annotation label to create cell pools.
Core module to perform gene filtering, normalisation, and gene specificity and distribution classification. Uses multi-processing to parallelise the processing of genes. Plotting functions of the gene class conservation heatmap is also included.
Cross-species comparison of gene classes and plotting. Using ensembl or eggNOG homology.
A comprehensive running example of performing gene classification is provided at run_classification_sum_cell_pools.py
python run_classification_sum_cell_pools.pyA large table containing the specificity and distribution classes, and the GO annotations, of all genes in the anndata object. Cross-species orthology-mapped results and figures are also available if performed.
Depending on the dataset size, and if parallelisation is used, the running time is estimated to be between 10 and 60 minutes.
The gene classification results for the three species datasets analysed in the preprint are publicly available at Zenodo.
Scripts and notebooks to recreate the analysis in the paper are available at GeneSpectra_reproducibility.
Developer/maintainer: Yuyao Song, ysong@ebi.ac.uk