This repository systematically evaluates CFP-GEN (Combinatorial Functional Protein Generation via Diffusion Language Model) on long protein sequence generation (>1,000 amino acids).
While CFP-GEN achieves state-of-the-art results on standard protein generation benchmarks, its ability to generalize to long multi-domain proteins remains unexplored. This project investigates:
- Does multi-constraint conditioning help long-sequence generation?
- Does performance degrade as sequence length increases?
- How does CFP-GEN compare to its unconditional backbone (DPLM)?
This work is built directly on top of the official CFP-GEN implementation.
- 📄 CFP-GEN Paper: https://arxiv.org/pdf/2505.22869
- 💻 CFP-GEN Repository: https://github.com/yinjunbo/cfpgen
- 📝 My term paper: ./asset/term_paper.pdf
Protein language models (PLMs) have achieved remarkable progress in discriminative and generative protein modeling, especially under biologically meaningful constraints such as InterPro (IPR) domains and Gene Ontology (GO) terms. Multi-constraint diffusion-based frameworks such as CFP-GEN demonstrate state-of-the-art performance on standard benchmarks. However, their ability to generalize to long protein sequence generation (>1,000 residues) remains largely unexplored, despite the biological relevance of large multi-domain proteins. In this work, I systematically evaluate CFP-GEN on long-sequence conditional generation using a curated Swiss-Prot dataset (1,000–2,000 residues). I compare it against its unconditional backbone (DPLM) under controlled IPR constraints.
Evaluation using Mean Reciprocal Rank (MRR) and Maximum Mean Discrepancy (MMD) shows that CFP-GEN underperforms unconditional baseline in terms of distributional alignment for long sequences (refer to Table 1). Additionally, as the sequence length increases, both models experience a decline in performance (see Figure 1). These findings suggest a better solution to resolve long-context generation challenges in Language Models as a whole.
| Model | MRR ↑ | MMD ↓ |
|---|---|---|
| DPLM (Uncondition) | 0.007 | 0.324 |
| CFP-GEN | 0.007 | 0.466 |
Table 1: MRR and MMD for the baseline and CFP-GEN model.
Figure 1: Scatter plot distribution of CFP-GEN and DPLM across sequence lengths.
# clone project
git clone --recursive https://github.com/KhondokerIslam/ml4lp.git
# create conda virtual environment
env_name=cfpgen
conda create -n ${env_name} python=3.9 pip
conda activate ${env_name}
# alternative
pip install -r requirements.txtCFP-GEN General Dataset (Paper Dataset): The processed dataset cfpgen_general_dataset can be downloaded from Google Drive and placed in the directory: dataset/general_dataset_go_ipr/cfpgen_general_dataset/. It contains 103,939 proteins annotated with 375 GO terms and 1,154 IPR domains.
GO/IPR Mapping: CFP-GEN only supports generation following ground-truth labels from natural proteins (e.g., from SwissProt).
The GO/IPR mapping info can be download here. Plase place them on dataset/ipr_mapping.json.
Long Sequence Dataset (This Work): We evaluate on proteins between 1,000 and 2,000 amino acids. You can download long Swiss-Prot sequences from UniProt. Place the downloaded .fasta and .tsv files inside dataset/. For reproducibility, the dataset used in this study is already included: ataset/uniprotkb_AND_reviewed_true_AND_model_o_2026_02_18.fasta, dataset/uniprotkb_AND_reviewed_true_AND_model_o_2026_02_18.tsv
Generate Final Evaluation Set: Run generate_test_suit.py. This will generate data-bin/uniprotKB/cfpgen_general_dataset/experiment.pkl required to run this experiment.
-
cfpgen-650m: Support conditioning on GO terms, IPR domains and sequence motifs (e.g., 10-30 residue fragments) defined by the general protein dataset. This model can be readily used for Functional Protein Generation. -
dplm-650m: This is the base pretrained model from DPLM, required to be placed undercfpgen/pretrained/.
Users could modify necessary parameters (e.g.,ckpt_path=<path_to_cfpgen-650m>) in the config file:
configs/test_cfpgen.yamland then run the following command to start generation:
python cfpgen_generate.pyThe results will be saved in ./generation-results.
To facilitate comparison, we also provide proteins generated by DPLM without any conditional input, as a reference baseline:
python dplm_generate.pyThe resulting FASTA files will be saved in: generation-results/dplm_650m/.
We only evaluate on sequence level (i.e., distributional statistics) and in particularly on IPR domains.
The following command computes Maximum Mean Discrepancy (MMD) and Mean Reciprocal Rank (MRR) between the generated and real sequences:
python eval_mmd.py <ipr> <fasta_filename> <gt_data>Here, <fasta_filename>is the output FASTA file obtained by the previous generation commands. <gt_data> refers to the ground-truth data file (e.g., data-bin/uniprotKB/cfpgen_general_dataset/test.pkl)
To be added soon.
