MCIENet: Multi-scale CNN-based Information Extraction from DNA Sequences for 3D Chromatin Interactions Prediction
- MCIENet
MCIENet/
├── MCIENet/
│ ├── model/ # Model implementation
│ │ ├── classifier.py # Classifier implementation
│ │ ├── data_extractor.py # Data extractor
│ │ ├── layers.py # Custom neural network layers
│ │ └── utils.py # Model related utility functions
│ │
│ ├── utils/ # Utility functions
│ │
│ ├── dataset.py # Dataset handling
│ ├── loop_model.py # Model training loop
│ └── trainer.py # Trainer implementation
│
├── conf/ # Configuration files
├── data/ # Dataset directory
├── docker/ # Docker-related files
├── figures/ # Figures
├── notebook/ # Jupyter Notebooks
├── output/ # Training outputs and logs
├── scripts/ # Utility scripts organized by workflow stages
│ ├── 1_get_neg-pos_data/ # Scripts for generating pos/neg pairs
│ ├── 2_generate_traindata/ # Scripts for preparing training data
│ ├── 3_train/ # Scripts for model training
│ ├── 4_XAI/ # Scripts for explainable AI analysis
│ ├── helper_scripts/ # Helper utilities
│ └── set_env/ # Environment setup scripts
│
├── data_helper.py # Data processing utilities
├── get_attr.py # Attribute access utilities
└── train.py # Training entry point
git clone https://github.com/aaron-ho/MCIENet.gityou may need to Install
dockeranddocker-composefirst
Create and enter the container
# Build and start the container (in background)
docker-compose -f docker/docker-compose.yml up -d
# Enter the container
docker-compose -f docker/docker-compose.yml exec mcienet /bin/bash
# You can now use MCIENet in the command line ...exit and remove the container
# To exit the container
exit
# To stop and remove the container
docker-compose -f docker/docker-compose.yml downnote: image we use will cost about 16.4GB disk space. If you don't have enough disk space, you can use option 2.
Scripts is under scripts/set_env, you can use it to setup the environment.
set-env_conda: set up conda environment for MCIENetset-env_venv: set up venv environment for MCIENetset-env_dnabert: set up conda environment for dnabert
note: these scripts just for reference, you need to customize your environment path in the script.
this project includes pre-processed data for two example datasets gm12878 ctcf and helas3 ctcf located in the data/proc/ directory. The pre-processing steps have already been completed for these example datasets. If you plan to use these datasets, you can skip the 2.1 Generate Pos-Neg Pairs step and proceed directly to 2.2 Generate Training Data.
- Raw data:
data/raw/- Contains the original input files (e.g., BED, BAM, FASTA files)- Important: You need to download the hg19 reference genome (hg19.fa) from UCSC and place it in this directory before running the scripts.
- Processed data:
data/proc/- Contains pre-processed data ready for training - Training data:
data/train/- Will contain the final training data generated from processed data
The scripts/1_get_neg-pos_data/ directory contains the data processing pipeline that transforms raw interaction data into training-ready format. This step is crucial for preparing both positive and negative samples for model training.
preprocess/: Contains scripts for initial data processingpipe.sh: Main pipeline script that orchestrates the preprocessing stepsprocess_pos.sh: Processes positive interaction samplesgenerate_*.py: Python scripts for generating and processing sample pairs
gm12878_ctcf.shandhelas3_ctcf.sh: Example scripts demonstrating how to run the pipeline
- BEDTools (includes
mergeBedandpairToBed):- Ubuntu/Debian:
sudo apt-get install bedtools - For other systems, see BEDTools documentation
- Ubuntu/Debian:
-
Place your raw data in the
data/raw/directory:- Interaction files in BEDPE format
- DNase/open chromatin regions in BED format
- Transcription factor peaks in BED format
-
Run the preprocessing pipeline:
# Example command structure ./scripts/1_get_neg-pos_data/preprocess/pipe.sh \ <interactions.bedpe> \ <dnase.bed> \ <tf_peaks.bed> \ <sample_name> \ <output_directory>
note: The preprocessing scripts in
scripts/1_get_neg-pos_data/preprocess/are adapted from the chinn repository.
For this tutorial, we'll use the pre-processed data in data/proc/gm12878_ctcf to generate the training data in data/train/gm12878_ctcf. You must download DNA sequences data(hg19.fa) from UCSC.
scripts/2_generate_traindata/gm12878_ctcf/Linux/1000bp.onehot.shafter the script is done, you will find the train data in data/train/gm12878_ctcf/1000bp.50ms.onehot/data.h5.
note: this proccess need at least 4GB memory. here is example scripts for 1000bp, when we use 2000 or 3000 bp as anchor size, we need more memory.
Train the BaseCNN model:
scripts/3_train/example/BaseCNN.shTrain the MCIENet model:
scripts/3_train/example/MCIENet.shPerform model interpretation using various XAI methods. Here we demonstrate using DeepLIFT:
we can use the model we already trained under output\best as the model path.
Here we use DeepLift as the XAI method, you can also use other methods like LIME, SHAP, etc. Details arguments can be found in get_attr.py.
python get_attr.py \
--model_folder "output/best/BaseCNN-gm12878.ctcf-1kb" \
--output_folder "output/XAI/BaseCNN-gm12878.ctcf-1kb" \
--data_folder "data/train/gm12878_ctcf/1000bp.50ms.onehot" \
--phases train val test \
--batch_size 500 \
--method "DeepLift" \
--crop_center 500 \
--crop_size 1000 \
--use_cuda Truepython get_attr.py \
--model_folder "output/best/MCIENet-gm12878.ctcf-1kb" \
--output_folder "output/XAI/MCIENet-gm12878.ctcf-1kb" \
--data_folder "data/train/gm12878_ctcf/1000bp.50ms.onehot" \
--phases train val test \
--batch_size 500 \
--method "DeepLift" \
--crop_center 500 \
--crop_size 1000 \
--use_cuda Truenote: more example scripts can be found in
scripts\4_XAI\Linux.
- Cao, Fan, et al. "Chromatin interaction neural network (ChINN): a machine learning-based method for predicting chromatin interactions from DNA sequences." Genome biology 22 (2021): 1-25. https://doi.org/10.1186/s13059-021-02453-5.
- Github: https://github.com/mjflab/chinn
- Zhou, Zhihan, et al. "Dnabert-2: Efficient foundation model and benchmark for multi-species genome." arXiv preprint arXiv:2306.15006 (2023). https://doi.org/10.48550/arXiv.2306.15006.
- Github: https://github.com/MAGICS-LAB/DNABERT_2
- Pretrain model: https://huggingface.co/zhihan1996/DNABERT-2-117M
This version of implementation is only for learning purpose. For research, please refer to and cite from the following paper:
@inproceedings{ MCIENet,
author = "Yen-Nan Ho and Jia-Ming Chang"
title = "MCIENet: Multi-scale CNN-based Information Extraction from DNA Sequences for 3D chromatin interactions Prediction",
booktitle = "",
pages = "",
year = "2025",
}
