5. Demos

hcwang and qxdu edited on Aug 4, 2023, 1 version

Caution: All demos with checkpoints have been uploaded in https://drive.google.com/drive/folders/1aQRU69PzXm36bPfIy-0W0HKS8TNlbrja

Demo1: Reconstruction of yeast Saccharomyces cerevisiae optimization trajectory [1]

The evolution, evolvability and engineering of gene regulatory DNA

In the work of Aviv et al., the length of the input sequence is all 110bp. "N" is used for padding at the beginning. Original researcher used their model to predict Random_test_tpu_model.csv , and get a relative good performance in Figure 1b, with pearson coefficient 0.97.

Here, we replicated their work using the AttnBiLSTM class and faithfully performed consistent data processing, ensuring that the learning rate, loss function, and other indicators were as consistent as possible. The original dataset size is too large, so we only randomly selected 100000 sequences for training and validation. Finally, we also applied our model on Random_test_tpu_model.csv. Surprisingly, although we have limited the training data size, the predicted results and the original results can still guarantee a correlation of 0.8. This indicates that our replication was successful. Datasets are obtained from https://codeocean.com/capsule/8020974/tree/v1 .

After using the gpro package to quickly replicate the predictor, we optimized the 80bp sequence in the center of the promoter sequence according to the pattern specified in the paper. Here, we show the replication results of the SSWM algorithm, which are consistent with figure 2f of original paper.

All codes and results needed for replication can be achieved in ./demo/demo1/.

Demo2: Generate Escherichia coli sequences and ensure similarity with natural [2]

Synthetic promoter design in Escherichia coli based on a deep generative network

In the work of Wang et al., the length of the input sequence is all 50bp. Considering the instability of WGAN model, researchers have selected the Generator with best performance, which also means with best k-mer similarity with original dataset. We have reproduced Figure S3 (k-mer estimation) of the original paper.

We have encapsulated all codes needed in our WGAN model previously mentioned. Here we select the relatively good results from the models obtained from 12 epochs. An acceptable result of 6-mer frequency is shown in the figure below, which is consistent with the original distribution. Datasets are obtained from sequence_data.txt, using all 14000+ sequences for wgan generation, from https://github.com/HaochenW/Deep_promoter/tree/master .

All codes needed for replication can be achieved in ./demo/demo2/script.py.

Demo3: Yeast promoter design with desired expression [3]

Controlling gene expression with deep generative design of regulatory DNA

In this work, we reproduced the gradient optimization method in the paper, using the gradient descent method to continuously seek optimization in the candidate sequence space until the predicted expression converges. For the lengthy training and prediction stages of the original paper, we used the built-in model of the gpro package to easily achieve one-stop generation.

datasets are obtained from https://zenodo.org/record/6811226, from scerevisiae.rsd1.lmbda_22.1000.npz dataset. We have reproduced the gradient optimization with our model, and reproduced Figure 3b of the original paper.

We did not obtain the data for tpm, so we take log2 rather than log10, considering the difference of 10-15 times between the average predicted value and tpm.

The results are shown in the following figure:

All codes needed for replication can be achieved in ./demo/demo3/script.py.

Demo4: In silico simulation for feedback-GAN [4]

Feedback GAN for DNA optimizes protein functions

The design logic of Feedback GAN is to train generators that increasingly meet the model's expectations by continuously replacing the real data in the training set with the simulation data preferred by the model， We have reproduced the trends of Levenshtein distance with natural data, in Figure 4b (normalized edited distance).

Here, we reproduce the GRUClassifier of the original paper and use it as a classifier to calculate BCELoss. We also use the Diffusion model as an alternative to the GAN model. Datasets are obtained from https://github.com/av1659/fbgan/tree/master/data .

There are some differences in the results, but the trend is consistent:

All codes needed for replication can be achieved in ./demo/demo4/script.py.

Demo5: Conditional GAN for promoter flanking sequences design [5]

Deep flanking sequence engineering for efficient promoter design

This is a work that has been received by Nature Communication but has not yet been published. We have simply reproduced Figure 2c, which integrates cgan into the gpro package and quickly implements functional reproduction.

The promoter 4-mer frequency generated by our replicated conditional expression is 0.946, which is very close to the original results.

All codes needed for replication can be achieved in ./demo/demo5/script.py.

Demo6: DeepSTARR2 model for transfer learning in classification tasks [6,7]

Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo

This work has been opened in Nature. We have reproduced theExtended Data Fig. 1b and Extended Data Fig. 3b, integrating regression model and binary classifier for functional tissue-specific enhancer design. Since the large scale of original dataset(https://zenodo.org/records/8011697), here we selected the epidermis tissue in fold9 for validation. We provided the processed epidermis data in google drive (https://drive.google.com/drive/folders/1yxt_4WNrK1WslvzFs05qll7djcGzKPYO) and all pretrained checkpoints in demo6.

Our pearson of DNA accessibility is 0.69961, and the final PR-AUC curve for classification is shown as follows. This is close to the original results, besides, it's also a practical pytorch version of DeepSTARR model.

All codes needed for replication can be achieved in ./demo/demo6/script.py.

Citations

[1] Vaishnav E D, de Boer C G, Molinet J, et al. The evolution, evolvability and engineering of gene regulatory DNA[J]. Nature, 2022, 603(7901): 455-463.
[2] Ye Wang and others, Synthetic promoter design in Escherichia coli based on a deep generative network, Nucleic Acids Research, Volume 48, Issue 12, 09 July 2020, Pages 6403–6412, https://doi.org/10.1093/nar/gkaa325
[3] Zrimec J, Fu X, Muhammad A S, et al. Controlling gene expression with deep generative design of regulatory DNA[J]. Nature communications, 2022, 13(1): 5099.
[4] Gupta A, Zou J. Feedback GAN for DNA optimizes protein functions[J]. Nature Machine Intelligence, 2019, 1(2): 105-111.
[5] Zhang P, Wang H, Xu H, et al. Deep flanking sequence engineering for efficient promoter design[J]. bioRxiv, 2023: 2023.04. 14.536502.
[6] de Almeida B P, Reiter F, Pagani M, et al. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers[J]. Nature Genetics, 2022, 54(5): 613-624.
[7] de Almeida B P, Schaub C, Pagani M, et al. Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo[J]. Nature, 2023: 1-2.

Installation - Quick Start - Cheat Sheet - Demos - Default Datasets - Model Comparisons - Customization

Getting Started

Functions

Utils

5. Demos

Demo1: Reconstruction of yeast Saccharomyces cerevisiae optimization trajectory [1]

Demo2: Generate Escherichia coli sequences and ensure similarity with natural [2]

Demo3: Yeast promoter design with desired expression [3]

Demo4: In silico simulation for feedback-GAN [4]

Demo5: Conditional GAN for promoter flanking sequences design [5]

Demo6: DeepSTARR2 model for transfer learning in classification tasks [6,7]

Citations

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Getting Started

Functions

Demos

Session Information

Default Datasets

Model Comparisons

Customization

Clone this wiki locally