-
Notifications
You must be signed in to change notification settings - Fork 10
5. Demos
hcwang and qxdu edited on Aug 4, 2023, 1 version
Caution: All demos with checkpoints have been uploaded in https://drive.google.com/drive/folders/1aQRU69PzXm36bPfIy-0W0HKS8TNlbrja
The evolution, evolvability and engineering of gene regulatory DNA
In the work of Aviv et al., the length of the input sequence is all 110bp. "N" is used for padding at the beginning. Original researcher used their model to predict Random_test_tpu_model.csv , and get a relative good performance in Figure 1b, with pearson coefficient 0.97.
Here, we replicated their work using the AttnBiLSTM class and faithfully performed consistent data processing, ensuring that the learning rate, loss function, and other indicators were as consistent as possible. The original dataset size is too large, so we only randomly selected 100000 sequences for training and validation. Finally, we also applied our model on Random_test_tpu_model.csv. Surprisingly, although we have limited the training data size, the predicted results and the original results can still guarantee a correlation of 0.8. This indicates that our replication was successful. Datasets are obtained from https://codeocean.com/capsule/8020974/tree/v1 .
After using the gpro package to quickly replicate the predictor, we optimized the 80bp sequence in the center of the promoter sequence according to the pattern specified in the paper. Here, we show the replication results of the SSWM algorithm, which are consistent with figure 2f of original paper.
All codes and results needed for replication can be achieved in ./demo/demo1/.
Synthetic promoter design in Escherichia coli based on a deep generative network
In the work of Wang et al., the length of the input sequence is all 50bp. Considering the instability of WGAN model, researchers have selected the Generator with best performance, which also means with best k-mer similarity with original dataset. We have reproduced Figure S3 (k-mer estimation) of the original paper.
We have encapsulated all codes needed in our WGAN model previously mentioned. Here we select the relatively good results from the models obtained from 12 epochs. An acceptable result of 6-mer frequency is shown in the figure below, which is consistent with the original distribution. Datasets are obtained from sequence_data.txt, using all 14000+ sequences for wgan generation, from https://github.com/HaochenW/Deep_promoter/tree/master .
All codes needed for replication can be achieved in ./demo/demo2/script.py.
Controlling gene expression with deep generative design of regulatory DNA
In this work, we reproduced the gradient optimization method in the paper, using the gradient descent method to continuously seek optimization in the candidate sequence space until the predicted expression converges. For the lengthy training and prediction stages of the original paper, we used the built-in model of the gpro package to easily achieve one-stop generation.
datasets are obtained from https://zenodo.org/record/6811226, from scerevisiae.rsd1.lmbda_22.1000.npz dataset. We have reproduced the gradient optimization with our model, and reproduced Figure 3b of the original paper.
We did not obtain the data for tpm, so we take log2 rather than log10, considering the difference of 10-15 times between the average predicted value and tpm.
The results are shown in the following figure:
All codes needed for replication can be achieved in ./demo/demo3/script.py.
Feedback GAN for DNA optimizes protein functions
The design logic of Feedback GAN is to train generators that increasingly meet the model's expectations by continuously replacing the real data in the training set with the simulation data preferred by the model, We have reproduced the trends of Levenshtein distance with natural data, in Figure 4b (normalized edited distance).
Here, we reproduce the GRUClassifier of the original paper and use it as a classifier to calculate BCELoss. We also use the Diffusion model as an alternative to the GAN model. Datasets are obtained from https://github.com/av1659/fbgan/tree/master/data .
There are some differences in the results, but the trend is consistent:
All codes needed for replication can be achieved in ./demo/demo4/script.py.
Deep flanking sequence engineering for efficient promoter design
This is a work that has been received by Nature Communication but has not yet been published. We have simply reproduced Figure 2c, which integrates cgan into the gpro package and quickly implements functional reproduction.
The promoter 4-mer frequency generated by our replicated conditional expression is 0.946, which is very close to the original results.
All codes needed for replication can be achieved in ./demo/demo5/script.py.
Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo
This work has been opened in Nature. We have reproduced theExtended Data Fig. 1b and Extended Data Fig. 3b, integrating regression model and binary classifier for functional tissue-specific enhancer design. Since the large scale of original dataset(https://zenodo.org/records/8011697), here we selected the epidermis tissue in fold9 for validation. We provided the processed epidermis data in google drive (https://drive.google.com/drive/folders/1yxt_4WNrK1WslvzFs05qll7djcGzKPYO) and all pretrained checkpoints in demo6.
Our pearson of DNA accessibility is 0.69961, and the final PR-AUC curve for classification is shown as follows. This is close to the original results, besides, it's also a practical pytorch version of DeepSTARR model.
All codes needed for replication can be achieved in ./demo/demo6/script.py.
[1] Vaishnav E D, de Boer C G, Molinet J, et al. The evolution, evolvability and engineering of gene regulatory DNA[J]. Nature, 2022, 603(7901): 455-463.
[2] Ye Wang and others, Synthetic promoter design in Escherichia coli based on a deep generative network, Nucleic Acids Research, Volume 48, Issue 12, 09 July 2020, Pages 6403–6412, https://doi.org/10.1093/nar/gkaa325
[3] Zrimec J, Fu X, Muhammad A S, et al. Controlling gene expression with deep generative design of regulatory DNA[J]. Nature communications, 2022, 13(1): 5099.
[4] Gupta A, Zou J. Feedback GAN for DNA optimizes protein functions[J]. Nature Machine Intelligence, 2019, 1(2): 105-111.
[5] Zhang P, Wang H, Xu H, et al. Deep flanking sequence engineering for efficient promoter design[J]. bioRxiv, 2023: 2023.04. 14.536502.
[6] de Almeida B P, Reiter F, Pagani M, et al. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers[J]. Nature Genetics, 2022, 54(5): 613-624.
[7] de Almeida B P, Schaub C, Pagani M, et al. Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo[J]. Nature, 2023: 1-2.