fast.ssgsea is an R package (R Core Team 2024) for
High-Performance Gene Set Enrichment Analysis (HP-GSEA). It is also
capable of performing Post-Translational Modification Signature
Enrichment Analysis (PTM-SEA) (Subramanian et al.
2005; Krug et al.
2019).
The primary function, fast_ssgsea, accepts a numeric matrix with genes
or other molecules as rows and either samples, contrasts, or some other
meaningful representation of the data as columns. A named list of gene
sets (more generally, molecular signatures) is also required. Other
arguments control the behavior of HP-GSEA/PTM-SEA, and they are
described in the function documentation.
The package also contains a read_gmt function, which reads a Gene
Matrix Transposed (GMT) file to construct a named list of gene sets for
use with fast_ssgsea.
R version 4.0.0 or greater is required to install fast.ssgsea.
It may be possible to get fast.ssgsea to work with older versions of R
by cloning the repository, changing the minimum R version in the
DESCRIPTION (e.g., to >= 3.6.0), and rebuilding the package, but users
should do so at their own risk.
A macOS binary is provided in the latest
release. Users looking to
build and install the development version of fast.ssgsea must have the
Xcode developer tools from Apple and a FORTRAN compiler installed. See
https://mac.r-project.org/tools/ for instructions.
No Windows binary is available, so Rtools must be installed to compile C++ code.
The development version of fast.ssgsea can be installed with
# install.packages("pak")
pak::pak("pnnl/fast.ssgsea")We will simulate a matrix with 10,000 genes as rows and one column. Then, we generate 20,000 gene sets by randomly sampling between 5 and 1,000 genes.
n_genes <- 10000L # number of genes
n_samples <- 1L # number of samples (>= 1)
genes <- paste0("gene", seq_len(n_genes))
samples <- paste0("sample", seq_len(n_samples))
## Simulate matrix of sample gene expression values
set.seed(9001L)
X <- matrix(
data = rnorm(n = n_genes * n_samples),
nrow = n_genes,
ncol = n_samples,
dimnames = list(genes, samples)
)
## Simulate list of gene sets
n_sets <- 20000L # number of gene sets
min_size <- 5L # size of smallest gene set
max_size <- 1000L # size of largest gene set
size_range <- max_size - min_size + 1L
n_reps <- ceiling(n_sets / size_range)
set_sizes <- rep(max_size:min_size, times = n_reps)[seq_len(n_sets)]
gene_sets <- lapply(seq_len(n_sets), function(i) {
set.seed(i)
sample(x = genes, size = set_sizes[i])
})
names(gene_sets) <- paste0("set", seq_along(gene_sets))This shows the runtime of fast_ssgsea on an AMD Ryzen 5 7600X CPU with
a clock speed of 4.7 GHz. A total of 100,000 permutations were used to
calculate P-values and normalized enrichment scores (NES).
library(fast.ssgsea)
# Runtime (elapsed time)
system.time({
res <- fast_ssgsea(
X = X,
gene_sets = gene_sets,
alpha = 1,
nperm = 1e5L,
min_size = min_size,
seed = 0L
)
})## user system elapsed
## 4.137 0.962 4.650
str(res)## 'data.frame': 20000 obs. of 9 variables:
## $ sample : Factor w/ 1 level "sample1": 1 1 1 1 1 1 1 1 1 1 ...
## $ set : chr "set18791" "set16136" "set19084" "set2830" ...
## $ set_size : int 138 801 841 163 706 749 450 87 161 761 ...
## $ ES : num -1866 709 698 1584 759 ...
## $ NES : num -5.3 4.65 4.68 4.76 4.68 ...
## $ n_same_sign : int 49042 52788 52782 50951 52785 47193 47813 50722 48979 47243 ...
## $ n_as_extreme: int 1 8 8 9 11 10 13 14 18 20 ...
## $ p_value : num 4.08e-05 1.70e-04 1.71e-04 1.96e-04 2.27e-04 ...
## $ adj_p_value : num 0.739 0.739 0.739 0.739 0.739 ...
print(sessionInfo(), locale = FALSE, tzone = FALSE)## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Linux Mint 22.1
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dqrng_0.4.1 fast.ssgsea_0.1.0.9022
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 RcppArmadillo_15.0.2-2 fastmap_1.2.0
## [4] xfun_0.53 Matrix_1.7-4 lattice_0.22-7
## [7] knitr_1.50 htmltools_0.5.8.1 rmarkdown_2.29
## [10] cli_3.6.5 grid_4.5.2 data.table_1.17.8
## [13] compiler_4.5.2 rstudioapi_0.17.1 tools_4.5.2
## [16] evaluate_1.0.5 Rcpp_1.1.0 yaml_2.3.10
## [19] rlang_1.1.6
The fast.ssgsea R package utilizes linear algebra and ideas from Fast
Gene Set Enrichment Analysis (Korotkevich et al.
2021) to greatly reduce the runtime.
Tests were performed on a desktop computer with an AMD Ryzen 5 7600X CPU
(6 cores, 12 threads) at 4.7 GHz. Different combinations of the number
of gene sets, maximum gene set size, number of permutations, and value
of the
Korotkevich, Gennady, Vladimir Sukhov, Nikolay Budin, Boris Shpak, Maxim N. Artyomov, and Alexey Sergushichev. 2021. “Fast Gene Set Enrichment Analysis.” bioRxiv. https://doi.org/10.1101/060012.
Krug, Karsten, Philipp Mertins, Bin Zhang, Peter Hornbeck, Rajesh Raju, Rushdy Ahmad, Matthew Szucs, et al. 2019. “A Curated Resource for Phosphosite-Specific Signature Analysis.” Molecular & Cellular Proteomics 18 (3): 576–93. https://doi.org/10.1074/mcp.TIR118.000943.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.
