Mutational Signature Analysis tool

MutTF is a multi-omics analysis framework that combines gene expression data with mutational signatures based on non-negative matrix decomposition and correlation analysis. MutTF can discover candidate transcription factors(TFs) that regulate target gene expression by mutational signatures.

Input Data

Variant files (.vcf)
VCF files from whole genome sequencing are required in this analysis. You can obtain VCF files after steps of aligning the reads to a reference genome, marking duplicates, performing local realignment and base quality recalibration, and calling variants using variant calling software.
Expression file (.tsv)
Expression file from RNA sequencing are required in this analysis. You can obtain expression files after steps of aligning the reads to a reference genome or transcriptome, and quantifying the read counts per gene or transcript.
(Optional) TF-TG geneset file (.txt)
We provide TF-TG geneset file obtained from hTFTarget, but if you wish to use a manual TF-TG geneset, the format of the file should be like this:

name description ...

TF_0 TG_0

TF_0 TG_1

... TG_2

"name" column contains TF, and "description" column contains group of TGs regulated by the corresponding TF.

Since VCF files used in this project require a dbGaP access request, only the RNA expression data are provided.

Installation

Clone repository.

git clone https://github.com/BML-cbnu/MutTF
cd MutTF

Install the Python requirements.

pip install -r requirements.txt

Install the required R packages (Tested with R 4.3.2 and Bioconductor 3.18) In R console:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# CRAN packages
install.packages(c("data.table", "dplyr", "optparse", "ggplot2"))

# Bioconductor packages
BiocManager::install(c("GSVA", "AnnotationDbi", "org.Hs.eg.db", 
                       "edgeR", "limma", "maftools", "viper"))

# Specific versions used in our analysis:
# GSVA          1.44.5
# AnnotationDbi 1.60.2
# org.Hs.eg.db  3.16.0
# data.table    1.14.8
# dplyr         1.1.1
# optparse      1.7.3
# ggplot2       3.4.1
# edgeR         3.40.2
# limma         3.54.2
# maftools      2.14.0
# viper         1.32.0

In the command line, please run the following:

Step1. Mutational signature extraction

Input:
VCF file per sample (.vcf)
Variable:
- [reference genome] → Enter the reference genome you want to analyze (e.g. GRCh37).
- [minimum] → Minimum number of signatures to extract
- [maximum] → Maximum number of signatures to extract
- [input directory] → Directory where vcf files are located (e.g. input_data, or './sample_data/demo_vcfs/' for demo files).
- [output directory] → Directory where the output data should be stored.
- [threads] → Number of threads to use in signature extraction
Description:
Used SigprofilerMatrixGenerator to convert vcf files into count matrix, and used sigProfilerExtractor to extract signatures based on the count matrix generated. The optimal number of signature will be selected and used for further analysis. (Refer to './[output directory]/SBS96/SBS96_selection_plot.pdf' for the best number of signature) In this project, we used SBS96-based signatures (96 types of mutations in Single Base Substitution) in further analysis. Refer to COSMIC Tools.
Output:
Directory including signature extraction results (./[output directory])
The results are as shown in the tables below:

Exposure Matrix
(./[output directory]/SBS96/Suggested_Solution/SBS96_De-Novo_Solution/Activities/SBS96_De-Novo_Activities_refit.txt)

Samples SBS96A SBS96B ...

Sample 1 22 40

Sample 2 35 13

... 16 32

Process Matrix
(./[output directory]/SBS96/Suggested_Solution/SBS96_De-Novo_Solution/Signatures/SBS96_De-Novo_Signatures.txt)

MutationType SBS96A SBS96B ...

A[C>A]A 0.024 0.014

A[C>A]C 0.012 0.052

... 0.081 0.068

$ python Signature_extraction.py --ref_genome=[reference genome] --minimum=[minimum] --maximum=[maximum] --input_dir=[input data directory] --output_dir=[output data directory] --threads=[threads]

Step2. Gene_count

Input:
VCF file per sample (.vcf)
Variable:
- [reference genome] → Enter the reference genome you want to analyze (e.g. GRCh37).
- [input directory] → Directory where vcf files are located (e.g. input_data).
- [output directory] → Directory where the output data should be stored.
- [threads] → Number of threads to use in multiprocessing
Description:
Before we calculate the contribution of signatures, we need gene-specific mutation counts calculated using the annotation file of reference genome.
Output:
Gene count matrix per sample (./[output directory]/*_cnt.csv)
The results are as shown in the table below:

Gene 1 Gene 2 ...

ACA>A 2 0

ACC>A 0 1

... 1 1

$ python Gene_count.py --ref_genome=[reference genome] --input_dir=[input directory] --output_dir=[output directory] --threads=[threads]

Step3. GSVA

Input:
TF-TG geneset file (.txt), Expression file (.tsv)
Variable:
- [TF-TG geneset file] → TF-TG geneset file (e.g. ./hTFTarget/colon_TF-Target-information.txt)
- [Expression file] → File name of gene expression file
- [GSVA output file] → File name of GSVA output results
Description:
Seperate TG into positively and negatively regulated groups based on correlation coefficient with corresponding TF expression value. Based on these groups, perform GSVA.
Output:
GSVA output file (./[GSVA output file].tsv)
The results are as shown in the table below:

Genesets Sample 1 Sample 2 ...

TF1_0 0.4 0.3

TF1_1 -0.9 -0.1

... 0.1 -0.6

$ python GSVA.py -g [TF-TG geneset file] -e [Expression file] -o [GSVA output file]

Step4. MutTF

Input:
Signature extraction results (dir), Gene count matrix per sample (.csv), TF-TG geneset file (.txt), GSVA results (.tsv)
Variable:
- [Signature extraction directory] → Directory of signature extraction results (output from Signature_extraction.py)
- [Gene count directory] → Directory with gene-wise mutation count files (output from Gene_count.py)
- [TF-TG geneset file] → TF-TG geneset file used in GSVA.py (e.g. ./hTFTarget/colon_TF-Target-information.txt)
- [GSVA output file] → File name of GSVA results (output from GSVA.py)
- [Correlation output directory] → Directory of correlation results between signature-induced mutation count and GSVA
Description:
Calculate the signature's contribution (by sample). Analyze the correlation between gene-specific counts by signature and the GSVA score.
Output:
Correlation result matrix (All and Filtered) (./[Correlation output directory]/[All/Filt]_Result_[pos/neg].csv)
The results are as shown in the table below:

No. Gene sig r p

0 Gene id signature id correlation coefficient p-value

$ python MutTF.py --ext_dir=[Signature extraction directory] --count_dir=[Gene count directory] --tf_file=[TF-TG geneset file] --gsva_folder=[GSVA output file] --corr_dir=[Correlation output directory]

Optional Code

Node_classification

Input:
Correlation results, GSVA results
Variable:
- [Correlation results] → File name of correlation results (output from MutTF.py)
- [direction] → Enter the group for which you want to proceed node classification (pos or neg)
- [GSVA results] → File name of GSVA results (output from GSVA.py)
- [Number of signatures] → The optimal number of signatures used for analysis
- [Output directory] → Directory of node classification results
Output:
Files including result of node classification and visualization.
The visualized graph figure is saved as '[Output_directory]/node_figure_XXX.png'

$ python Node_classification.py --corr_dir=[Correlation results] --pos_neg=[direction] --gsva=[GSVA file] --sig_num=[Number of signatures] --out_dir=[Output directory]

Denovo_cosine

Input:
Matrix P
Variable:
- [Signature extraction directory] → Directory of signature extraction results
- [reference genome] → Enter the reference genome you want to analyze (e.g. GRCh37).
- [version] → Enter the version of cosmic signature you want to compare (e.g. 3.3.1)
Output:
Image showing cosine similarity
A heat map shows how the optimal signature extracted by De novo Signatures from NMF.py is similar to the cosmic signature.
We referred from COSMIC Signatures.
Examples are as follows:

$ python Denovo_cosine.py --ext_dir=[Signature extraction directory] --ref_genome=[reference genome] --version=[version]

Contributors

Jiwon You
- M.S.
- Department of Computer Engineering, Chungbuk National University, Republic of Korea.
- 71one.you@gmail.com
Yooeun Kim
- M.S.
- Interdisciplinary Program in Bioinformatics, Seoul National University, Republic of Korea.
- ys910111@snu.ac.kr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mutational Signature Analysis tool

Input Data

Installation

How to execute code

Step1. Mutational signature extraction

Step2. Gene_count

Step3. GSVA

Step4. MutTF

Optional Code

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
data		data
hTFtarget		hTFtarget
readme_img		readme_img
sample_data		sample_data
.gitignore		.gitignore
Denovo_cosine.py		Denovo_cosine.py
GSVA.R		GSVA.R
GSVA.py		GSVA.py
Gene_count.py		Gene_count.py
MatGen.py		MatGen.py
MutTF.py		MutTF.py
NMF.py		NMF.py
Node_classification.py		Node_classification.py
README.md		README.md
Signature_extraction.py		Signature_extraction.py
requirements.txt		requirements.txt

name	description	...
TF_0	TG_0
TF_0	TG_1
...	TG_2

Samples	SBS96A	SBS96B
Sample 1	22	40
Sample 2	35	13
...	16	32

MutationType	SBS96A	SBS96B
A[C>A]A	0.024	0.014
A[C>A]C	0.012	0.052
...	0.081	0.068

	Gene 1	Gene 2
ACA>A	2	0
ACC>A	0	1
...	1	1

Folders and files

Latest commit

History

Repository files navigation

Mutational Signature Analysis tool

Input Data

Installation

How to execute code

Step1. Mutational signature extraction

Step2. Gene_count

Step3. GSVA

Step4. MutTF

Optional Code

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages