DExOmics

This repository provides code and processed data for the paper:
"An Interpretable Multi-Omics Integration Framework for Differential Expression Prediction in Cancer".

1. Dependencies

Tested Environment
OS: Ubuntu 24.04.1 LTS
Kernel: 6.8.0-51-generic
conda: 24.4.0

To recreate the virtual environment, run:

git clone https://github.com/hmdlab/DExOmics.git
cd DExOmics
conda env create -f dependencies_all.yml
conda activate dexomics

2. Data Sources

All data downloading scripts are provided under the data_download/ directory.

Pan-cancer study

Create a directory:
```
mkdir data
```
Download the archive pancan_data.tar.gz and place it inside the data directory.

Cancer-specific study

Run the following command to download TCGA omics data for LIHC and CESC:
```
Rscript load_*.R [cancer_type]
```
Output files will be stored under data/TCGAdata/.
Transcription factor (TF) and RNA-binding protein (RBP) binding peak files are defined in .txt.gz files under data_download/. To download them, run:
```
bash load_regulator.sh
```

Metadata needs to be manually downloaded from the first lines of the files and placed into according directories.

Additionally, download human.txt.gz from POSTAR3 and place it in the data/ directory. You can extract HeLa RBP-binding BED files using:
```
Rscript split_HeLa.R
```

3. Proprocessing and Integration

Mapping BED Features to RNA Coordinates

Convert BED-format genomic interactions to transcript-relative coordinates and sparse matrices:

mkdir data/promoter_features
Rscript 01_bed_to_RNA_coord.R -b "../data/HepG2_bed_rna" -n 100 -g "../data/pancan_data/references_v8_gencode.v26.GRCh38.genes.gtf" -t "promoter" -o "../data/promoter_features/encode_hepg2_promoter" -s "ENCODE"
python 02_to_sparse.py ../data/promoter_features/encode_hepg2_promoter.txt

mkdir data/rna_features
Rscript 01_bed_to_RNA_coord.R -b "../data/HepG2_bed_rna" -n 100 -g "../data/pancan_data/references_v8_gencode.v26.GRCh38.genes.gtf" -t "rna" -o "../data/rna_features/encode_hepg2_rna" -s "ENCODE"
python 02_to_sparse.py ../data/rna_features/encode_hepg2_rna.txt

Arguments:
-b: BED file directory
-n: Bin size (genomic resolution)
-g: Path to the GTF annotation
-t: Type ("promoter" or "rna")
-o: Output path
-s: Data source name

You can also download preprocessed files:

TCGA Data Preprocessing and Integration

First, download the Methylation Array Gene Annotation File, place it in data/TCGAdata/, and unzip. From scripts/cancer_specific/, run:

Rscript data_observe.R LIHC
Rscript dea.R LIHC hepg2
Rscript data_merge.R LIHC hepg2 TRUE    # TRUE to merge with ENCODE expression data
python get_HepG2_genes.py LIHC hepg2

Replace arguments with the desired TCGA project and related cell line. The processed data is also available as TCGAprocessed.tar.gz.

4. Analysis

Pan-cancer study

Train and evaluate the model under scripts/pan_cancer/:

python pretrain.py ../../pancanatlas_model/ -p pancanatlas -bs 50 -n 100 -lr 0.001 -step 30 -reg 0.001
python eval.py ../../pancanatlas_model/ -p pancanatlas -n 100 -reg 0.001
Rscript calc_performance.R pancanatlas

Interpret results using DeepLIFT:

python compute_shap.py ../../shap/DeepLIFT_pancanatlas/ -p pancanatlas
Rscript summarize_SHAP.R pancanatlas ../../shap/DeepLIFT_pancanatlas/
Rscript shap_plot.R pancanatlas ../../shap/DeepLIFT_pancanatlas/ ../../plots_pancanatlas/

Cancer-specific study

Under scripts/cancer_specific/, train and evaluate the model:

python pretrain.py LIHC hepg2 ../../model_LIHC/concat/ -bs 50 -n 100 -lr 0.001 -step 30 -reg 0.0001
python eval.py LIHC hepg2 ../../model_LIHC/concat/ -n 100 -reg 0.0001

Interpret using ExpectedGrad:

python compute_shap.py LIHC hepg2 ../../shap/ExpectedGrad_LIHC/
Rscript summarize_SHAP.R LIHC ../../shap/ExpectedGrad_LIHC/
Rscript shap_plot.R ../../shap/ExpectedGrad_LIHC/ ../../plots_LIHC/global/

You can substitute LIHC/hepg2 with other projects and cell lines (e.g., CESC/hela). Pretrained models are available here, and the resulting metrics are recorded here

5. Citation

If you find this project helpful, please cite:

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data_download		data_download
pretrained		pretrained
sample_data		sample_data
scripts		scripts
LICENSE		LICENSE
README.md		README.md
dependencies_all.yml		dependencies_all.yml
dependencies_mini.yml		dependencies_mini.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DExOmics

1. Dependencies

2. Data Sources

Pan-cancer study

Cancer-specific study

3. Proprocessing and Integration

Mapping BED Features to RNA Coordinates

TCGA Data Preprocessing and Integration

4. Analysis

Pan-cancer study

Cancer-specific study

5. Citation

About

Uh oh!

Releases

Packages

Languages

License

hmdlab/DExOmics

Folders and files

Latest commit

History

Repository files navigation

DExOmics

1. Dependencies

2. Data Sources

Pan-cancer study

Cancer-specific study

3. Proprocessing and Integration

Mapping BED Features to RNA Coordinates

TCGA Data Preprocessing and Integration

4. Analysis

Pan-cancer study

Cancer-specific study

5. Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages