Skip to content

a hierarchical and cell-type-specific genome organization generator

License

Notifications You must be signed in to change notification settings

JWei2015/HiCGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HiCGen

a hierarchical and cell-type-specific genome organization generator

HiCGen is a deep learning framework for predicting multiscale 3D genome organization (1 kb to 128 kb resolution) using DNA sequences and genomic features. Built on Swin-Transformer, HiCGen enables cross-cell-type predictions and in silico perturbation analysis to study structural consequences of genetic/epigenetic changes.

Paper: Formal Publication | bioRxiv Preprint | Demo Data: Data link

HiCGen Overview

Key Features

  • Multiscale Prediction: Generate hierarchical contact maps (1 kb to 128 kb resolutions) from sequence and epigenetic signals.
  • Cross-Cell Generalization: Predict chromatin architecture for unseen cell types using cell-specific ATAC-seq/ChIP-seq profiles.
  • Perturbation Analysis: Simulate structural changes caused by enhancer/promoter activation/silencing or CTCF boundary editing.

Installation

Dependencies

Setup

  1. Clone this repository:
    git clone https://github.com/JWei2015/HiCGen.git
    cd HiCGen
  2. Install dependencies via conda:
    conda create -n hicgen python=3.9
    conda activate hicgen
    conda env update -f requirements.txt
    

Usage

Data Preparation

  1. Input Formats:
  • DNA Sequence: genomic sequences were derived from the GRCh38/hg38 reference genome in hg38.fa format.
  • Epigenetic Signals: preprocessed ATAC-seq/ChIP-seq in .bw (BigWig) format.
  • Hi-C Data: normalized and zoomified contact matrices in .mcool format.
  1. Data preprocessing: see Paper: Link

Training process

  • HiCGen surpports command-line-interface for training and inference. For training on a new cell type, just execute the commands below in a terminal:
    python train.py --celltype IMR90 --fold fold1 --pred-mode SwinT4M 
  • Here the --celltype parameter specifies the filename that contains genomic features and contact maps of the training cell. The --fold parameter specifies training/validating/test sets within the fold.txt file. Currently we support two types of --pred-mode: i.e. SwinT4M and SwinT32M. SwinT32M should be trained based on the checkpoints of a pre-trained SwinT4M model.

Prediction

  • For predictions, execute the commands below in a terminal:
    python prediction.py --celltype IMR90 --chr chr15 --pos 59100000 --res 1024 --model checkpoints/models/tmp.ckpt 

About

a hierarchical and cell-type-specific genome organization generator

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages