Understanding complex interactions directly from data is crucial across many disciplines. Many-body interactions shape physics, biology, neuroscience, and social systems, playing a key role in emergence, regulation, and coordination. Although generative models excel at identifying high-order correlations, deriving meaningful insights from them remains challenging. Here, we tackle this problem for generic categorical energy-based generative models and introduce an efficient algorithm to extract effective higher-order couplings from Restricted Boltzmann Machines (RBMs) at affordable times.
Fig-1. Pipeline of the rbmDCA. After the training of the neural network (e.g., an RBM) (b) with data (e.g., MSA) (a), we mapped the trained model (b) onto a Potts-like model (c). Parameters of (c) can be used to predict epistatic contacts in the tertiary structure of the protein. (d) shows the contact prediction obtained for the Response Regulator Receiver Domain family (Pfam entry: PF00072), where light-gray dots are the contacts of the protein, red dots are true positives, and green dots are false positives. We showed the prediction obtained with our RBM-based inference (rbmDCA) in the upper-left part of the matrix, while the prediction obtained with the well-established pseudo-likelihood inference (plmDCA) is shown in the lower-right part. This repo presents how we go from a trained model (b) to the contact prediction (d).
-
couplings_inference: Detailed presentation and implementation of the Python functions to compute effective couplings. This implementation requires the PyTorch Library.
-
- PF00072.fasta: Multiple Sequence Analysis data of the Response Regulator Receiver domain (PF00072) [2]. The original dataset can be found here.
- PF00072_struct.dat: Structural data for the Response Regulator Receiver domain (PF00072) [2]. The original dataset can be found here.
- PF00072_train=0.6.fasta, PF00072_test=0.4.fasta: training and test datasets used for RBM training. Both were derived from the original PF00072.fasta dataset.
- plmDCA_score_PF00072_train=0.6.txt: Contact prediction score with plmDCA [3,4] used to compare our results. This score was computed using this repository.
Datasets of the inverse Blume-Capel problem used to benchmark our RBM training in [1]:
- 1D_Blume_nsamples=100000_L=51_beta=0.2_J3=1.0_J2=1.0_h=0.0.h5,
- 1D_Blume_nsamples=100000_L=51_beta=0.2_J3=2.0_J2=1.0_h=0.0.h5,
- 1D_Blume_nsamples=100000_L=51_beta=0.2_J3=3.0_J2=1.0_h=0.0.h5.
-
models: Trained RBM models used as examples.
- Decelle, A., Navas Gómez, A. J., & Seoane, B. (2025). Inferring Higher-Order Couplings with Neural Networks. Physical Review Letters, 135, 207301.
- Trinquier, J., Uguzzoni, G., Pagnani, A., Zamponi, F., & Weigt, M. (2021). Efficient generative modeling of protein sequences using simple autoregressive models. Nature communications, 12(1), 5800.
- Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M., & Aurell, E. (2013). Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Physical Review E—Statistical, Nonlinear, and Soft Matter Physics, 87(1), 012707.
- Ekeberg, M., Hartonen, T., & Aurell, E. (2014). Fast pseudolikelihood maximization for direct-coupling analysis of protein structure from many homologous amino-acid sequences. Journal of Computational Physics, 276, 341-356.
