Hi! Thank you for developing and maintaining this fantastic benchmark! I have been interested in reproducing the ProteinGym zero-shot substitution benchmark and noticed a couple things I'd appreciate clarification on.
1. ESM-1b Scoring Strategy
The paper (Section A.2, p32) states:
"We predict fitness for ESM models with the masked-marginal approach introduced in Meier et al. [2021]"
This implies all ESM models (ESM-1b, ESM-1v, ESM-2) use masked-marginals. However:
| Script |
Line |
scoring_strategy |
scoring_ESM1b_substitutions.sh |
13 |
wt-marginals |
scoring_ESM1v_substitutions.sh |
19 |
masked-marginals |
scoring_ESM2_substitutions.sh |
30 |
masked-marginals |
Question: Was the ESM-1b benchmark result computed with wt-marginals (per the script) or masked-marginals (per the paper)?
Additionally, the masked-marginals implementation in compute_fitness.py (lines 486-514) masks each position independently across L forward passes to build a full per-position probability table:
for i in range(batch_tokens.size(1)): # line 489: loop over ALL positions
batch_tokens_masked[0, i] = alphabet.mask_idx # line 491: mask one at a time
This differs from Meier et al.'s Strategy (a) (Eq. 1, p4), which masks all mutated positions simultaneously in a single forward pass. For single substitutions the two approaches are mathematically identical, but for multi-mutants they produce different scores (the
independent masking gives each position more context). Is this the intended behavior for multi-mutant scoring?
2. Score Changes Between v1.0 and v1.1
Between v1.0 (commit 4d3d391) and v1.1 (commit 8f1ce5f), scores changed for every model in:
benchmarks/DMS_zero_shot/substitutions/Spearman/Summary_performance_DMS_substitutions_Spearman.csv
The v1.1 release note says "Updates to reference file" but doesn't explain why existing model scores changed.
| Model (v1.0 row → v1.1 row) |
Column |
v1.0 (4d3d391) |
v1.1 (8f1ce5f) |
Delta |
| ESM-1v single (27→44) |
Average_Spearman |
0.385 |
0.374 |
−0.011 |
| ESM-1v single (27→44) |
Function_Expression |
0.431 |
0.405 |
−0.026 |
| ESM-1v single (27→44) |
Function_Stability |
0.476 |
0.437 |
−0.039 |
| ESM-1v ensemble (17→29) |
Average_Spearman |
0.416 |
0.406 |
−0.010 |
| ESM-1v ensemble (17→29) |
Function_Expression |
0.456 |
0.429 |
−0.027 |
| ESM-1b (22→35) |
Average_Spearman |
0.399 |
0.394 |
−0.005 |
| ESM-1b (22→35) |
Function_Expression |
0.427 |
0.406 |
−0.021 |
| ESM2 650M (16→27) |
Function_Expression |
0.439 |
0.415 |
−0.024 |
| CARP 640M (33→47) |
Function_Expression |
0.419 |
0.397 |
−0.022 |
| Wavenet (48→45) |
Average_Spearman |
0.215 |
0.373 |
+0.158 |
Note: Rows shifted between versions because new models were added, changing the Model_rank ordering.
Diffing reference_files/DMS_substitutions.csv between the two commits shows only cosmetic changes:
- 3 assay ID renames (e.g.,
PSAE_SYNP2_Tsuboyama_2023_1PSE → PSAE_PICP2_Tsuboyama_2023_1PSE)
- Formatting of binarization cutoffs (
-1.0 → -1)
- A new
pdb_range column
Question: Were the underlying DMS data CSV files updated between v1.0 and v1.1? I would love to know why this change occurred, in case I am overlooking something or if these are new numbers I should aim for in my benchmark replication.
Hi! Thank you for developing and maintaining this fantastic benchmark! I have been interested in reproducing the ProteinGym zero-shot substitution benchmark and noticed a couple things I'd appreciate clarification on.
1. ESM-1b Scoring Strategy
The paper (Section A.2, p32) states:
This implies all ESM models (ESM-1b, ESM-1v, ESM-2) use masked-marginals. However:
scoring_strategyscoring_ESM1b_substitutions.shwt-marginalsscoring_ESM1v_substitutions.shmasked-marginalsscoring_ESM2_substitutions.shmasked-marginalsQuestion: Was the ESM-1b benchmark result computed with
wt-marginals(per the script) ormasked-marginals(per the paper)?Additionally, the masked-marginals implementation in compute_fitness.py (lines 486-514) masks each position independently across L forward passes to build a full per-position probability table:
This differs from Meier et al.'s Strategy (a) (Eq. 1, p4), which masks all mutated positions simultaneously in a single forward pass. For single substitutions the two approaches are mathematically identical, but for multi-mutants they produce different scores (the
independent masking gives each position more context). Is this the intended behavior for multi-mutant scoring?
2. Score Changes Between v1.0 and v1.1
Between v1.0 (commit
4d3d391) and v1.1 (commit8f1ce5f), scores changed for every model in:The v1.1 release note says "Updates to reference file" but doesn't explain why existing model scores changed.
4d3d391)8f1ce5f)Average_SpearmanFunction_ExpressionFunction_StabilityAverage_SpearmanFunction_ExpressionAverage_SpearmanFunction_ExpressionFunction_ExpressionFunction_ExpressionAverage_SpearmanDiffing
reference_files/DMS_substitutions.csvbetween the two commits shows only cosmetic changes:PSAE_SYNP2_Tsuboyama_2023_1PSE→PSAE_PICP2_Tsuboyama_2023_1PSE)-1.0→-1)pdb_rangecolumnQuestion: Were the underlying DMS data CSV files updated between v1.0 and v1.1? I would love to know why this change occurred, in case I am overlooking something or if these are new numbers I should aim for in my benchmark replication.