SignifiKANTE builds upon the arboreto software library to enable regression-based gene regulatory network inference and efficient, permutation-based empirical P-value computation for predicted regulatory links.
SignifiKANTE is installable via pip from PyPI using
pip install signifikanteor locally from this repository with
git clone git@github.com:bionetslab/SignifiKANTE.git
cd SignifiKANTE
pip install -e .For installation with pixi, download pixi, install and run
git clone git@github.com:bionetslab/SignifiKANTE.git
cd SignifiKANTE
pixi installCreate a jupyter kernel using pixi.toml/pyproject.toml, which will install a jupyter kernel using a custom environment (including ipython)
git clone git@github.com:bionetslab/SignifiKANTE.git
cd SignifiKANTE
pixi run -e kernel install-kernelWe provide an efficient FDR control for regulatory links based on any given regression-based GRN inference method. Currently, for GRN inference SignifiKANTE includes GRNBoost2, GENIE3, xgboost, and lasso regression. For the integration of further regression-based GRN inference methods, please see our manual in the section below. Here, we also provide a minimal working example of how to use SignifiKANTE based on GRNBoost2 on a simulated dataset:
import pandas as pd
import numpy as np
from signifikante.algo import signifikante_fdr
if __name__ == "__main__":
# Simulate expression dataset with 100 samples and 10 genes.
expression_data = np.random.randn(100, 10)
expression_df = pd.DataFrame(expression_data, columns=[f"Gene{i}" for i in range(10)])
# Simulate three artificial TFs.
tf_list = [f"Gene{i}" for i in range(3)]
# Run SignifiKANTE's approximate FDR control.
fdr_grn = signifikante_fdr(
expression_data=expression_df,
normalize_gene_expression=True,
tf_names=tf_list,
cluster_representative_mode="random",
num_target_clusters=2,
inference_mode="grnboost2",
apply_bh_correction=True)
print(fdr_grn)Below, you can find a more detailed description of the parameters of SignifiKANTE's central function for FDR control signifikante_fdr. The two absolutely necessary input parameters are:
expression_data [pd.DataFrame]: Expression matrix with genes as columns and samples as rows.cluster_representative_mode [str]: How to draw representatives from target gene clusters. Can be one of "random" or "medoid" for approximate P-value computation, or "all_genes" for exact (DIANE-like) P-values.
Additional parameters of SignifiKANTE's FDR control:
normalize_gene_expression [bool]: Whether or not to apply z-score normalization on gene columns in input expression matrix.inference_mode [str]: Which GRN inference method to use under the hood. Can be one of "grnboost2", "genie3", "xgboost", and "lasso". Defaults to "grnboost2".num_permutations [int]: How many permutations to perform for random background model for empirical P-value computation. Defaults to 1000.tf_names [list]: List of strings representing TF names. Should be subset of gene names contained inexpression_data. Defaults to None. If no list is given, all genes are treated as potential TFs.apply_bh_correction [bool]: Whether or not to additionally return Benjamini-Hochberg adjusted P-values.input_grn [pd.DataFrame]: Reference GRN to use for FDR control. Needs to possess columns 'TF', 'target', 'importance'. Should only be used, when it is clear that this GRN is inferred using the same method indicated ininference_mode. Defaults to None. If no reference GRN is given, a new one is inferred in the beginning.target_subset [list]: Subset of target genes to consider for FDR control. Only compatible with "all_genes" FDR mode.num_target_clusters [int]: Number of target gene clusters. If set to -1, no target gene clustering will be applied. Defaults to -1.num_tf_clusters [int]: Experimental feature. Used for setting the number of desired TF clusters, if set to -1, no TF clustering will be applied. Defaults to -1.target_cluster_mode [str]: Experimental feature. Indicates, which clustering to use for target gene clustering. Defaults to "wasserstein".tf_cluster_mode [str]: Experimental feature. Indicates, which clustering mode to use for TF clustering. Defaults to "correlation".scale_for_tf_sampling [bool]: Experimental feature. Whether or not to keep track of occurences of edges in permuted GRNs. Defaults to False.
Further more technical parameters:
client [str,Dask.Client]: Whether to perform computation on given input Dask Cluster object, or to create a new local one ("local"). Defaults to "local".early_stop_window_length [int]: Window length to use for early stopping. Defaults to 25.seed [int]: Random seed for regressor models. Defaults to None.verbose [bool]: Whether or not to print detailed additional information. Defaults to False.output_dir [str]: Where to save additional intermediate data to. Defaults to None, i.e. saves no intermediate results.
The function returns a pandas dataframe representing the reference GRN with columns 'TF', 'target', and 'importance'. The column 'pvalue' stores empirical P-values per edge. If apply_bh_correction=True, an additional column 'pvalue_bh' is returned.
In order to integrate new regression-based GRN inference methods into SignifiKANTE, simply use the following steps, which exemplify the integration of lasso regression as implemented in the GRENADINE package:
- Give your regression-based method an abbreviated string-based name (
regressor_type) and name the variable storing its model-specific parameters (regressor_args), then add those to the existing accepted values of theinference_modeparameter within the functionsignifikante_fdrin the filealgo.py, directly below the indicated line statingUPDATE FOR NEW GRN METHOD. In the case of lasso regression, we simply added the regressor type "LASSO" and the regressor parameters stored inLASSO_KWARGSin the respective code block:
# UPDATE FOR NEW GRN METHOD
if inference_mode == "grnboost2":
regressor_type = "GBM"
regressor_args = SGBM_KWARGS
# other existing methods...
elif inference_mode == "lasso":
regressor_type = "LASSO"
regressor_args = LASSO_KWARGSSince the actual parameters of LASSO_KWARGS will be defined in another file, you need to make sure to import the variable into algo.py. To achieve this, simply add your new regressor's arguments variable at the top of algo.py, directly below the indicated line stating UPDATE FOR NEW GRN METHOD, just like this:
# UPDATE FOR NEW GRN METHOD
from signifikante.core import (
create_graph, SGBM_KWARGS, RF_KWARGS, EARLY_STOP_WINDOW_LENGTH, ET_KWARGS, XGB_KWARGS, LASSO_KWARGS
)- Now we switch to the file
core.py. At the top of the file, add any required import-statements for your regression to work (e.g. imports of sklearn). Below import statements, create a dictionary named exactly like the regressor's arguments variable you imported inalgo.py. You can include it directly below the line stating# UPDATE FOR NEW GRN METHOD, analogously to how we did it for the lasso regression:
from sklearn.linear_model import Lasso
# ... other code in between
LASSO_KWARGS = {
'alpha' : 0.01
}The actual logic of your new regression-based inference method will be implemented in the function fit_model. There, you should implement a new local function that contains the logic of your new model, given a tf_matrix and a target_gene_expression vector, such as we did for lasso regression:
def do_lasso_regression():
regressor = Lasso(**regressor_kwargs, random_state=seed)
regressor.fit(tf_matrix, target_gene_expression)
return regressorDirectly below, add another case distinction for your regressor_type which calls your locally defined function. The exact position is indicated by the line stating # UPDATE FOR NEW GRN METHOD:
# UPDATE FOR NEW GRN METHOD
if is_sklearn_regressor(regressor_type):
return do_sklearn_regression()
# other methods...
elif is_lasso_regressor(regressor_type):
return do_lasso_regression()Finally, in the function to_feature_importances, you have to implement the extraction of feature importances or model coefficients from your trained_regressor, which are supposed to represent edge weights in the inferred GRN. To accomplish that, add another case for your new regressor in the case distinction below the line stating # UPDATE FOR NEW GRN METHOD. For lasso regression this looks like:
# UPDATE FOR NEW GRN METHOD
if is_oob_heuristic_supported(regressor_type, regressor_kwargs):
# other code...
elif regressor_type.upper() == "LASSO":
scores = np.abs(trained_regressor.coef_)
return scoresDone, you have successfully added your new desired regression method for GRN inference!
Unit tests for arboreto-based functionalities, as well as additional tests for SignifiKANTE's FDR control functionality and a comparison of our efficiently parallelized Wasserstein-distance computation against SciPy can be found under tests/. The tests are based on Python's unittest module, and can be run all-together from the repository's root-directory with
python -m unittest discover -s tests -v