This repository provides code and processed data for the paper:
"An Interpretable Multi-Omics Integration Framework for Differential Expression Prediction in Cancer".
Tested Environment
OS: Ubuntu 24.04.1 LTS
Kernel: 6.8.0-51-generic
conda: 24.4.0
To recreate the virtual environment, run:
git clone https://github.com/hmdlab/DExOmics.git
cd DExOmics
conda env create -f dependencies_all.yml
conda activate dexomicsAll data downloading scripts are provided under the data_download/ directory.
- Create a directory:
mkdir data
- Download the archive pancan_data.tar.gz and place it inside the
datadirectory.
-
Run the following command to download TCGA omics data for LIHC and CESC:
Rscript load_*.R [cancer_type]Output files will be stored under
data/TCGAdata/. -
Transcription factor (TF) and RNA-binding protein (RBP) binding peak files are defined in
.txt.gzfiles underdata_download/. To download them, run:bash load_regulator.sh
Metadata needs to be manually downloaded from the first lines of the files and placed into according directories.
- Additionally, download human.txt.gz from POSTAR3 and place it in the
data/directory. You can extract HeLa RBP-binding BED files using:Rscript split_HeLa.R
Convert BED-format genomic interactions to transcript-relative coordinates and sparse matrices:
mkdir data/promoter_features
Rscript 01_bed_to_RNA_coord.R -b "../data/HepG2_bed_rna" -n 100 -g "../data/pancan_data/references_v8_gencode.v26.GRCh38.genes.gtf" -t "promoter" -o "../data/promoter_features/encode_hepg2_promoter" -s "ENCODE"
python 02_to_sparse.py ../data/promoter_features/encode_hepg2_promoter.txt
mkdir data/rna_features
Rscript 01_bed_to_RNA_coord.R -b "../data/HepG2_bed_rna" -n 100 -g "../data/pancan_data/references_v8_gencode.v26.GRCh38.genes.gtf" -t "rna" -o "../data/rna_features/encode_hepg2_rna" -s "ENCODE"
python 02_to_sparse.py ../data/rna_features/encode_hepg2_rna.txt
Arguments:
-b: BED file directory
-n: Bin size (genomic resolution)
-g: Path to the GTF annotation
-t: Type ("promoter" or "rna")
-o: Output path
-s: Data source name
You can also download preprocessed files:
First, download the Methylation Array Gene Annotation File, place it in data/TCGAdata/, and unzip. From scripts/cancer_specific/, run:
Rscript data_observe.R LIHC
Rscript dea.R LIHC hepg2
Rscript data_merge.R LIHC hepg2 TRUE # TRUE to merge with ENCODE expression data
python get_HepG2_genes.py LIHC hepg2Replace arguments with the desired TCGA project and related cell line. The processed data is also available as TCGAprocessed.tar.gz.
Train and evaluate the model under scripts/pan_cancer/:
python pretrain.py ../../pancanatlas_model/ -p pancanatlas -bs 50 -n 100 -lr 0.001 -step 30 -reg 0.001
python eval.py ../../pancanatlas_model/ -p pancanatlas -n 100 -reg 0.001
Rscript calc_performance.R pancanatlasInterpret results using DeepLIFT:
python compute_shap.py ../../shap/DeepLIFT_pancanatlas/ -p pancanatlas
Rscript summarize_SHAP.R pancanatlas ../../shap/DeepLIFT_pancanatlas/
Rscript shap_plot.R pancanatlas ../../shap/DeepLIFT_pancanatlas/ ../../plots_pancanatlas/Under scripts/cancer_specific/, train and evaluate the model:
python pretrain.py LIHC hepg2 ../../model_LIHC/concat/ -bs 50 -n 100 -lr 0.001 -step 30 -reg 0.0001
python eval.py LIHC hepg2 ../../model_LIHC/concat/ -n 100 -reg 0.0001Interpret using ExpectedGrad:
python compute_shap.py LIHC hepg2 ../../shap/ExpectedGrad_LIHC/
Rscript summarize_SHAP.R LIHC ../../shap/ExpectedGrad_LIHC/
Rscript shap_plot.R ../../shap/ExpectedGrad_LIHC/ ../../plots_LIHC/global/You can substitute
LIHC/hepg2with other projects and cell lines (e.g.,CESC/hela). Pretrained models are available here, and the resulting metrics are recorded here
If you find this project helpful, please cite: