Skip to content

Interpreting differential gene expression in cancer through AI-guided analysis of multi-omics data

License

Notifications You must be signed in to change notification settings

hmdlab/DExOmics

Repository files navigation

DExOmics

This repository provides code and processed data for the paper:
"An Interpretable Multi-Omics Integration Framework for Differential Expression Prediction in Cancer".

1. Dependencies

Tested Environment
OS: Ubuntu 24.04.1 LTS
Kernel: 6.8.0-51-generic
conda: 24.4.0

To recreate the virtual environment, run:

git clone https://github.com/hmdlab/DExOmics.git
cd DExOmics
conda env create -f dependencies_all.yml
conda activate dexomics

2. Data Sources

All data downloading scripts are provided under the data_download/ directory.

Pan-cancer study

  • Create a directory:
    mkdir data
  • Download the archive pancan_data.tar.gz and place it inside the data directory.

Cancer-specific study

  • Run the following command to download TCGA omics data for LIHC and CESC:

    Rscript load_*.R [cancer_type]

    Output files will be stored under data/TCGAdata/.

  • Transcription factor (TF) and RNA-binding protein (RBP) binding peak files are defined in .txt.gz files under data_download/. To download them, run:

    bash load_regulator.sh

Metadata needs to be manually downloaded from the first lines of the files and placed into according directories.

  • Additionally, download human.txt.gz from POSTAR3 and place it in the data/ directory. You can extract HeLa RBP-binding BED files using:
    Rscript split_HeLa.R

3. Proprocessing and Integration

Mapping BED Features to RNA Coordinates

Convert BED-format genomic interactions to transcript-relative coordinates and sparse matrices:

mkdir data/promoter_features
Rscript 01_bed_to_RNA_coord.R -b "../data/HepG2_bed_rna" -n 100 -g "../data/pancan_data/references_v8_gencode.v26.GRCh38.genes.gtf" -t "promoter" -o "../data/promoter_features/encode_hepg2_promoter" -s "ENCODE"
python 02_to_sparse.py ../data/promoter_features/encode_hepg2_promoter.txt

mkdir data/rna_features
Rscript 01_bed_to_RNA_coord.R -b "../data/HepG2_bed_rna" -n 100 -g "../data/pancan_data/references_v8_gencode.v26.GRCh38.genes.gtf" -t "rna" -o "../data/rna_features/encode_hepg2_rna" -s "ENCODE"
python 02_to_sparse.py ../data/rna_features/encode_hepg2_rna.txt

Arguments:
-b: BED file directory
-n: Bin size (genomic resolution)
-g: Path to the GTF annotation
-t: Type ("promoter" or "rna")
-o: Output path
-s: Data source name

You can also download preprocessed files:

TCGA Data Preprocessing and Integration

First, download the Methylation Array Gene Annotation File, place it in data/TCGAdata/, and unzip. From scripts/cancer_specific/, run:

Rscript data_observe.R LIHC
Rscript dea.R LIHC hepg2
Rscript data_merge.R LIHC hepg2 TRUE    # TRUE to merge with ENCODE expression data
python get_HepG2_genes.py LIHC hepg2

Replace arguments with the desired TCGA project and related cell line. The processed data is also available as TCGAprocessed.tar.gz.

4. Analysis

Pan-cancer study

Train and evaluate the model under scripts/pan_cancer/:

python pretrain.py ../../pancanatlas_model/ -p pancanatlas -bs 50 -n 100 -lr 0.001 -step 30 -reg 0.001
python eval.py ../../pancanatlas_model/ -p pancanatlas -n 100 -reg 0.001
Rscript calc_performance.R pancanatlas

Interpret results using DeepLIFT:

python compute_shap.py ../../shap/DeepLIFT_pancanatlas/ -p pancanatlas
Rscript summarize_SHAP.R pancanatlas ../../shap/DeepLIFT_pancanatlas/
Rscript shap_plot.R pancanatlas ../../shap/DeepLIFT_pancanatlas/ ../../plots_pancanatlas/

Cancer-specific study

Under scripts/cancer_specific/, train and evaluate the model:

python pretrain.py LIHC hepg2 ../../model_LIHC/concat/ -bs 50 -n 100 -lr 0.001 -step 30 -reg 0.0001
python eval.py LIHC hepg2 ../../model_LIHC/concat/ -n 100 -reg 0.0001

Interpret using ExpectedGrad:

python compute_shap.py LIHC hepg2 ../../shap/ExpectedGrad_LIHC/
Rscript summarize_SHAP.R LIHC ../../shap/ExpectedGrad_LIHC/
Rscript shap_plot.R ../../shap/ExpectedGrad_LIHC/ ../../plots_LIHC/global/

You can substitute LIHC/hepg2 with other projects and cell lines (e.g., CESC/hela). Pretrained models are available here, and the resulting metrics are recorded here

5. Citation

If you find this project helpful, please cite:

About

Interpreting differential gene expression in cancer through AI-guided analysis of multi-omics data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published