MutTF is a multi-omics analysis framework that combines gene expression data with mutational signatures based on non-negative matrix decomposition and correlation analysis. MutTF can discover candidate transcription factors(TFs) that regulate target gene expression by mutational signatures.
-
Variant files (.vcf)
VCF files from whole genome sequencing are required in this analysis. You can obtain VCF files after steps of aligning the reads to a reference genome, marking duplicates, performing local realignment and base quality recalibration, and calling variants using variant calling software. -
Expression file (.tsv)
Expression file from RNA sequencing are required in this analysis. You can obtain expression files after steps of aligning the reads to a reference genome or transcriptome, and quantifying the read counts per gene or transcript. -
(Optional)
TF-TG geneset file (.txt)
We provide TF-TG geneset file obtained from hTFTarget, but if you wish to use a manual TF-TG geneset, the format of the file should be like this:name description ... TF_0 TG_0 TF_0 TG_1 ... TG_2 "name" column contains TF, and "description" column contains group of TGs regulated by the corresponding TF.
Since VCF files used in this project require a dbGaP access request, only the RNA expression data are provided.
- Clone repository.
git clone https://github.com/BML-cbnu/MutTF
cd MutTF
- Install the Python requirements.
pip install -r requirements.txt
- Install the required R packages (Tested with R 4.3.2 and Bioconductor 3.18) In R console:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# CRAN packages
install.packages(c("data.table", "dplyr", "optparse", "ggplot2"))
# Bioconductor packages
BiocManager::install(c("GSVA", "AnnotationDbi", "org.Hs.eg.db",
"edgeR", "limma", "maftools", "viper"))
# Specific versions used in our analysis:
# GSVA 1.44.5
# AnnotationDbi 1.60.2
# org.Hs.eg.db 3.16.0
# data.table 1.14.8
# dplyr 1.1.1
# optparse 1.7.3
# ggplot2 3.4.1
# edgeR 3.40.2
# limma 3.54.2
# maftools 2.14.0
# viper 1.32.0In the command line, please run the following:
-
Input:
VCF file per sample (.vcf) -
Variable:- [reference genome] → Enter the reference genome you want to analyze (e.g. GRCh37).
- [minimum] → Minimum number of signatures to extract
- [maximum] → Maximum number of signatures to extract
- [input directory] → Directory where vcf files are located (e.g. input_data, or './sample_data/demo_vcfs/' for demo files).
- [output directory] → Directory where the output data should be stored.
- [threads] → Number of threads to use in signature extraction
-
Description:
Used SigprofilerMatrixGenerator to convert vcf files into count matrix, and used sigProfilerExtractor to extract signatures based on the count matrix generated. The optimal number of signature will be selected and used for further analysis. (Refer to './[output directory]/SBS96/SBS96_selection_plot.pdf' for the best number of signature) In this project, we used SBS96-based signatures (96 types of mutations in Single Base Substitution) in further analysis. Refer to COSMIC Tools. -
Output:
Directory including signature extraction results (./[output directory])
The results are as shown in the tables below:Exposure Matrix
(./[output directory]/SBS96/Suggested_Solution/SBS96_De-Novo_Solution/Activities/SBS96_De-Novo_Activities_refit.txt)Samples SBS96A SBS96B ... Sample 1 22 40 Sample 2 35 13 ... 16 32 Process Matrix
(./[output directory]/SBS96/Suggested_Solution/SBS96_De-Novo_Solution/Signatures/SBS96_De-Novo_Signatures.txt)MutationType SBS96A SBS96B ... A[C>A]A 0.024 0.014 A[C>A]C 0.012 0.052 ... 0.081 0.068
$ python Signature_extraction.py --ref_genome=[reference genome] --minimum=[minimum] --maximum=[maximum] --input_dir=[input data directory] --output_dir=[output data directory] --threads=[threads]-
Input:
VCF file per sample (.vcf) -
Variable:- [reference genome] → Enter the reference genome you want to analyze (e.g. GRCh37).
- [input directory] → Directory where vcf files are located (e.g. input_data).
- [output directory] → Directory where the output data should be stored.
- [threads] → Number of threads to use in multiprocessing
-
Description:
Before we calculate the contribution of signatures, we need gene-specific mutation counts calculated using the annotation file of reference genome. -
Output:
Gene count matrix per sample (./[output directory]/*_cnt.csv)
The results are as shown in the table below:Gene 1 Gene 2 ... ACA>A 2 0 ACC>A 0 1 ... 1 1
$ python Gene_count.py --ref_genome=[reference genome] --input_dir=[input directory] --output_dir=[output directory] --threads=[threads]-
Input:
TF-TG geneset file (.txt), Expression file (.tsv) -
Variable:- [TF-TG geneset file] → TF-TG geneset file (e.g. ./hTFTarget/colon_TF-Target-information.txt)
- [Expression file] → File name of gene expression file
- [GSVA output file] → File name of GSVA output results
-
Description:
Seperate TG into positively and negatively regulated groups based on correlation coefficient with corresponding TF expression value. Based on these groups, perform GSVA. -
Output:
GSVA output file (./[GSVA output file].tsv)
The results are as shown in the table below:Genesets Sample 1 Sample 2 ... TF1_0 0.4 0.3 TF1_1 -0.9 -0.1 ... 0.1 -0.6
$ python GSVA.py -g [TF-TG geneset file] -e [Expression file] -o [GSVA output file]-
Input:
Signature extraction results (dir), Gene count matrix per sample (.csv), TF-TG geneset file (.txt), GSVA results (.tsv) -
Variable:- [Signature extraction directory] → Directory of signature extraction results (output from Signature_extraction.py)
- [Gene count directory] → Directory with gene-wise mutation count files (output from Gene_count.py)
- [TF-TG geneset file] → TF-TG geneset file used in GSVA.py (e.g. ./hTFTarget/colon_TF-Target-information.txt)
- [GSVA output file] → File name of GSVA results (output from GSVA.py)
- [Correlation output directory] → Directory of correlation results between signature-induced mutation count and GSVA
-
Description:
Calculate the signature's contribution (by sample). Analyze the correlation between gene-specific counts by signature and the GSVA score. -
Output:
Correlation result matrix (All and Filtered) (./[Correlation output directory]/[All/Filt]_Result_[pos/neg].csv)
The results are as shown in the table below:No. Gene sig r p 0 Gene id signature id correlation coefficient p-value
$ python MutTF.py --ext_dir=[Signature extraction directory] --count_dir=[Gene count directory] --tf_file=[TF-TG geneset file] --gsva_folder=[GSVA output file] --corr_dir=[Correlation output directory]Node_classification
Input:
Correlation results, GSVA resultsVariable:
- [Correlation results] → File name of correlation results (output from MutTF.py)
- [direction] → Enter the group for which you want to proceed node classification (pos or neg)
- [GSVA results] → File name of GSVA results (output from GSVA.py)
- [Number of signatures] → The optimal number of signatures used for analysis
- [Output directory] → Directory of node classification results
Output:
Files including result of node classification and visualization.
The visualized graph figure is saved as '[Output_directory]/node_figure_XXX.png'
$ python Node_classification.py --corr_dir=[Correlation results] --pos_neg=[direction] --gsva=[GSVA file] --sig_num=[Number of signatures] --out_dir=[Output directory]Denovo_cosine
-
Input:
Matrix P -
Variable:- [Signature extraction directory] → Directory of signature extraction results
- [reference genome] → Enter the reference genome you want to analyze (e.g. GRCh37).
- [version] → Enter the version of cosmic signature you want to compare (e.g. 3.3.1)
-
Output:
Image showing cosine similarity -
A heat map shows how the optimal signature extracted by De novo Signatures from NMF.py is similar to the cosmic signature.
-
We referred from COSMIC Signatures.
$ python Denovo_cosine.py --ext_dir=[Signature extraction directory] --ref_genome=[reference genome] --version=[version]- Jiwon You
- M.S.
- Department of Computer Engineering, Chungbuk National University, Republic of Korea.
- 71one.you@gmail.com
- Yooeun Kim
- M.S.
- Interdisciplinary Program in Bioinformatics, Seoul National University, Republic of Korea.
- ys910111@snu.ac.kr

