Evaluating SoTa PLMs on Long Sequence Generation

Overview

This repository systematically evaluates CFP-GEN (Combinatorial Functional Protein Generation via Diffusion Language Model) on long protein sequence generation (>1,000 amino acids).

While CFP-GEN achieves state-of-the-art results on standard protein generation benchmarks, its ability to generalize to long multi-domain proteins remains unexplored. This project investigates:

Does multi-constraint conditioning help long-sequence generation?
Does performance degrade as sequence length increases?
How does CFP-GEN compare to its unconditional backbone (DPLM)?

This work is built directly on top of the official CFP-GEN implementation.

📄 CFP-GEN Paper: https://arxiv.org/pdf/2505.22869
💻 CFP-GEN Repository: https://github.com/yinjunbo/cfpgen
📝 My term paper: ./asset/term_paper.pdf

Abstract

Protein language models (PLMs) have achieved remarkable progress in discriminative and generative protein modeling, especially under biologically meaningful constraints such as InterPro (IPR) domains and Gene Ontology (GO) terms. Multi-constraint diffusion-based frameworks such as CFP-GEN demonstrate state-of-the-art performance on standard benchmarks. However, their ability to generalize to long protein sequence generation (>1,000 residues) remains largely unexplored, despite the biological relevance of large multi-domain proteins. In this work, I systematically evaluate CFP-GEN on long-sequence conditional generation using a curated Swiss-Prot dataset (1,000–2,000 residues). I compare it against its unconditional backbone (DPLM) under controlled IPR constraints.

Evaluation using Mean Reciprocal Rank (MRR) and Maximum Mean Discrepancy (MMD) shows that CFP-GEN underperforms unconditional baseline in terms of distributional alignment for long sequences (refer to Table 1). Additionally, as the sequence length increases, both models experience a decline in performance (see Figure 1). These findings suggest a better solution to resolve long-context generation challenges in Language Models as a whole.

Result

Model	MRR ↑	MMD ↓
DPLM (Uncondition)	0.007	0.324
CFP-GEN	0.007	0.466

Table 1: MRR and MMD for the baseline and CFP-GEN model.

Findings

Figure 1: Scatter plot distribution of CFP-GEN and DPLM across sequence lengths.

CFP-Gen

Installation

# clone project
git clone --recursive https://github.com/KhondokerIslam/ml4lp.git

# create conda virtual environment
env_name=cfpgen

conda create -n ${env_name} python=3.9 pip
conda activate ${env_name}

# alternative
pip install -r  requirements.txt

Datasets

CFP-GEN General Dataset (Paper Dataset): The processed dataset cfpgen_general_dataset can be downloaded from Google Drive and placed in the directory: dataset/general_dataset_go_ipr/cfpgen_general_dataset/. It contains 103,939 proteins annotated with 375 GO terms and 1,154 IPR domains.

GO/IPR Mapping: CFP-GEN only supports generation following ground-truth labels from natural proteins (e.g., from SwissProt). The GO/IPR mapping info can be download here. Plase place them on dataset/ipr_mapping.json.

Long Sequence Dataset (This Work): We evaluate on proteins between 1,000 and 2,000 amino acids. You can download long Swiss-Prot sequences from UniProt. Place the downloaded .fasta and .tsv files inside dataset/. For reproducibility, the dataset used in this study is already included: ataset/uniprotkb_AND_reviewed_true_AND_model_o_2026_02_18.fasta, dataset/uniprotkb_AND_reviewed_true_AND_model_o_2026_02_18.tsv

Generate Final Evaluation Set: Run generate_test_suit.py. This will generate data-bin/uniprotKB/cfpgen_general_dataset/experiment.pkl required to run this experiment.

Notes:

cfpgen-650m: Support conditioning on GO terms, IPR domains and sequence motifs (e.g., 10-30 residue fragments) defined by the general protein dataset. This model can be readily used for Functional Protein Generation.
dplm-650m: This is the base pretrained model from DPLM, required to be placed under cfpgen/pretrained/.

Generation with CFP-Gen

Functional Protein Generation

Users could modify necessary parameters (e.g.,ckpt_path=<path_to_cfpgen-650m>) in the config file:

configs/test_cfpgen.yaml

and then run the following command to start generation:

python cfpgen_generate.py

The results will be saved in ./generation-results.

Unconditional Protein Generation

To facilitate comparison, we also provide proteins generated by DPLM without any conditional input, as a reference baseline:

python dplm_generate.py

The resulting FASTA files will be saved in: generation-results/dplm_650m/.

Evaluation

We only evaluate on sequence level (i.e., distributional statistics) and in particularly on IPR domains.

Distribution Evaluation

The following command computes Maximum Mean Discrepancy (MMD) and Mean Reciprocal Rank (MRR) between the generated and real sequences:

python eval_mmd.py <ipr> <fasta_filename> <gt_data>

Here, <fasta_filename>is the output FASTA file obtained by the previous generation commands. <gt_data> refers to the ground-truth data file (e.g., data-bin/uniprotKB/cfpgen_general_dataset/test.pkl)

Citation

To be added soon.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
analysis		analysis
assets		assets
configs		configs
data-bin/uniprotKB		data-bin/uniprotKB
dataset		dataset
eval		eval
plots		plots
scripts		scripts
src/byprot		src/byprot
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
analysis.ipynb		analysis.ipynb
cfpgen_generate.py		cfpgen_generate.py
dplm_generate.py		dplm_generate.py
generate_test_suit.py		generate_test_suit.py
install.sh		install.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating SoTa PLMs on Long Sequence Generation

Overview

Abstract

Result

Findings

CFP-Gen

Installation

Datasets

Notes:

Generation with CFP-Gen

Functional Protein Generation

Unconditional Protein Generation

Evaluation

Distribution Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evaluating SoTa PLMs on Long Sequence Generation

Overview

Abstract

Result

Findings

CFP-Gen

Installation

Datasets

Notes:

Generation with CFP-Gen

Functional Protein Generation

Unconditional Protein Generation

Evaluation

Distribution Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages