This repository provides the code for the 18 transformation methods evaluated in the study: A Comprehensive Benchmarking and Practical Guide to Transformation Methods for Spatial Transcriptomics and Downstream Analyses. The methods are designed to be easily called within any spatial transcriptomics analysis pipeline.
Spatial resolved transcriptomics (SRT) allows for the localization of gene expression to specific regions of tissue, aiding in the investigation of spatially dependent biological phenomena. Due to the many advantages of SRT over other transcriptomics technologies, several computational methods have been designed to analyze spatial transcriptomics data and extract biologically relevant spatial information. Despite the diversity of these methods, all pipelines typically begin with preprocessing of the raw expression data. Preprocessing is required to correct for the technical noise introduced by the spatial transcriptomics platform, which often obscures underlying biological signals.
| Name | Category | Function | Description |
|---|---|---|---|
| y/s | Size-Factor-Based | size | Adjusts gene counts by the library size factor for each spatial location. |
| CPM | Size-Factor-Based | cpm | Adjusts gene counts by the counts per million (CPM) library size factor for each spatial location. |
| scanpy Zheng | Size-Factor-Based | zheng | Adjusts gene counts using a size normalization, a logarithmic shift, and z normalization for each gene. |
| TMM | Size-Factor-Based | tmm | Estimates scale factors using log-fold changes between each location and a reference, excluding genes with extreme expression. |
| DESeq2 | Size-Factor-Based | deseq2 | Computes scale factors by comparing each gene’s expression relative to a pseudo-reference sample. |
| log(y/s + 1) | Delta-Method-Based | shifted_log | Stabilizes the variance across genes. |
| log(CPM + 1) | Delta-Method-Based | cpm_shifted_log | Stabilizes the variance across genes. |
| log(y/s + 1)/u | Delta-Method-Based | shifted_log_size | Stabilizes the variance across genes. |
| acosh(2αy/s + 1) | Delta-Method-Based | acosh | Stabilizes the variance across genes. |
| log(y/s + 1/(4α)) | Delta-Method-Based | pseudo_shifted_log | Stabilizes the variance across genes. |
| Analytic Pearson (no clip) | Model-Based | analytic_pearson_noclip | Assumes gene counts fit a negative binomial (NB) distribution, and adjusts them using a Pearson residual. |
| Analytic Pearson (clip) | Model-Based | analytic_pearson_clip | Assumes gene counts fit a negative binomial (NB) distribution, and adjusts them using a Pearson residual, with an additional clipping step. |
| scanpy Pearson Residual | Model-Based | sc_pearson | Assumes gene counts fit a negative binomial (NB) distribution, and adjusts them using a Pearson residual. |
| Normalisr | Model-Based | normalisr | Applies Bayesian inference to model expression variance and to correct for confounding factors. |
| PsiNorm | Model-Based | psinorm | Assumes a Pareto distribution and rescales each gene’s count using a closed-form estimator of global expression based on Zipf’s Law. |
| SCTransform | Model-Based | sctransform | Assumes gene counts fit a negative binomial (NB) distribution, and adjusts them using a generalized linear model (GLM) to account for library size variation. |
| Dino | Model-Based | dino | Assumes gene counts fit a mixed negative binomial (NB) distribution, and adjusts them used a generalized linear model (GLM) to account for library size variation. |
| SpaNorm | Spatially Aware | spanorm | Assumes gene counts fit a negative binomial (NB) distribution, and adjusts them using a generalized linear model (GLM) to account for library size variation and spatial gradients. |
This toolkit can be integrated into any spatial transcriptomics pipeline by simply importing the python module. Use of the methods in the spTransKit requires Python version >= 3.10.0 and R version >= 4.5.0. Also, to install the other packages required for functionality, download the "requirements.txt" file included on the spTransKit GitHub page, and run the following command:
pip3 install -r /LOCAL/PATH/TO/requirements.txt
Then, to install the spTransKit package, run the command:
pip3 install sptranskit
Import the transformations module using the following line of code:
import sptranskit as sp
Each transformation takes in a scanpy AnnData object (data), which stores both gene expression and spatial information. Gene expression information is formatted as an N x G matrix and stored in data.X. Spatial information is formatted as an N x 2 matrix and stored in data.obsm["spatial"]. The spTransKit functions will check to make sure that the spatial information is stored correctly.
Below is an example of how to read an example dataset (DLPFC 151673), filter the data for low quality genes and spatial locations, and then transform the gene count matrix using the log(y/s + 1) transformation.
# Obtain the gene counts and spatial information for the DLPFC 151673 dataset
data = sp.helpers.get_unfiltered_dlpfc_data("151673")
# Filter the dataset
data = sp.filter.filter_counts(data)
# Transform the gene count matrix
data = sp.transformations.shifted_log(data)
The steps for utilizing spTransKit for preprocessing new data are the same as outlined in the above example. First, ensure that your data are saved in an ".h5ad" file, with the gene expression and spatial information stored in the appropriate locations in the AnnData object, as outlined in the Input Data Format section. Then, use the scanpy read_h5ad function to read in the data as follows:
# Obtain gene counts and spatial information for new dataset
data = scanpy.read_h5ad("\LOCAL\PATH\TO\data.h5ad")
Once the data are read in, use the spTransKit functions to filter and transform the dataset. The following example, once again, uses the log(y/s + 1) transformation.
# Filter the dataset
data = sp.filter.filter_counts(data)
# Transform the gene count matrix
data = sp.transformations.shifted_log(data)
