ESM-1b scoring strategy & v1.0→v1.1 score changes across all models

Hi! Thank you for developing and maintaining this fantastic benchmark! I have been interested in reproducing the ProteinGym zero-shot substitution benchmark and noticed a couple things I'd appreciate clarification on.

---

### 1. ESM-1b Scoring Strategy 

The paper (Section A.2, p32) states:

> "We predict fitness for ESM models with the masked-marginal approach introduced in Meier et al. [2021]"

This implies all ESM models (ESM-1b, ESM-1v, ESM-2) use masked-marginals. However:

| Script | Line | `scoring_strategy` |
|---|---|---|
| `scoring_ESM1b_substitutions.sh` | 13 | `wt-marginals` |
| `scoring_ESM1v_substitutions.sh` | 19 | `masked-marginals` |
| `scoring_ESM2_substitutions.sh` | 30 | `masked-marginals` |

**Question:** Was the ESM-1b benchmark result computed with `wt-marginals` (per the script) or `masked-marginals` (per the paper)?

Additionally, the masked-marginals implementation in compute_fitness.py (lines 486-514) masks each position independently across L forward passes to build a full per-position probability table:

```
  for i in range(batch_tokens.size(1)):          # line 489: loop over ALL positions
      batch_tokens_masked[0, i] = alphabet.mask_idx  # line 491: mask one at a time
```

  This differs from Meier et al.'s Strategy (a) (Eq. 1, p4), which masks all mutated positions simultaneously in a single forward pass. For single substitutions the two approaches are mathematically identical, but for multi-mutants they produce different scores (the
  independent masking gives each position more context). Is this the intended behavior for multi-mutant scoring?

---

### 2. Score Changes Between v1.0 and v1.1

Between v1.0 (commit `4d3d391`) and v1.1 (commit `8f1ce5f`), scores changed for every model in:
```
benchmarks/DMS_zero_shot/substitutions/Spearman/Summary_performance_DMS_substitutions_Spearman.csv
```

The v1.1 release note says *"Updates to reference file"* but doesn't explain why existing model scores changed. 

| Model (v1.0 row → v1.1 row) | Column | v1.0 (`4d3d391`) | v1.1 (`8f1ce5f`) | Delta |
|---|---|---|---|---|
| ESM-1v single (27→44) | `Average_Spearman` | 0.385 | 0.374 | −0.011 |
| ESM-1v single (27→44) | `Function_Expression` | 0.431 | 0.405 | −0.026 |
| ESM-1v single (27→44) | `Function_Stability` | 0.476 | 0.437 | −0.039 |
| ESM-1v ensemble (17→29) | `Average_Spearman` | 0.416 | 0.406 | −0.010 |
| ESM-1v ensemble (17→29) | `Function_Expression` | 0.456 | 0.429 | −0.027 |
| ESM-1b (22→35) | `Average_Spearman` | 0.399 | 0.394 | −0.005 |
| ESM-1b (22→35) | `Function_Expression` | 0.427 | 0.406 | −0.021 |
| ESM2 650M (16→27) | `Function_Expression` | 0.439 | 0.415 | −0.024 |
| CARP 640M (33→47) | `Function_Expression` | 0.419 | 0.397 | −0.022 |
| Wavenet (48→45) | `Average_Spearman` | 0.215 | 0.373 | **+0.158** |

> **Note:** Rows shifted between versions because new models were added, changing the `Model_rank` ordering.

Diffing `reference_files/DMS_substitutions.csv` between the two commits shows only cosmetic changes:
- 3 assay ID renames (e.g., `PSAE_SYNP2_Tsuboyama_2023_1PSE` → `PSAE_PICP2_Tsuboyama_2023_1PSE`)
- Formatting of binarization cutoffs (`-1.0` → `-1`)
- A new `pdb_range` column

**Question:** Were the underlying DMS data CSV files updated between v1.0 and v1.1? I would love to know why this change occurred, in case I am overlooking something or if these are new numbers I should aim for in my benchmark replication.

Script	Line	`scoring_strategy`
`scoring_ESM1b_substitutions.sh`	13	`wt-marginals`
`scoring_ESM1v_substitutions.sh`	19	`masked-marginals`
`scoring_ESM2_substitutions.sh`	30	`masked-marginals`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESM-1b scoring strategy & v1.0→v1.1 score changes across all models #99

1. ESM-1b Scoring Strategy

2. Score Changes Between v1.0 and v1.1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model (v1.0 row → v1.1 row)	Column	v1.0 (`4d3d391`)	v1.1 (`8f1ce5f`)	Delta
ESM-1v single (27→44)	`Average_Spearman`	0.385	0.374	−0.011
ESM-1v single (27→44)	`Function_Expression`	0.431	0.405	−0.026
ESM-1v single (27→44)	`Function_Stability`	0.476	0.437	−0.039
ESM-1v ensemble (17→29)	`Average_Spearman`	0.416	0.406	−0.010
ESM-1v ensemble (17→29)	`Function_Expression`	0.456	0.429	−0.027
ESM-1b (22→35)	`Average_Spearman`	0.399	0.394	−0.005
ESM-1b (22→35)	`Function_Expression`	0.427	0.406	−0.021
ESM2 650M (16→27)	`Function_Expression`	0.439	0.415	−0.024
CARP 640M (33→47)	`Function_Expression`	0.419	0.397	−0.022
Wavenet (48→45)	`Average_Spearman`	0.215	0.373	+0.158

ESM-1b scoring strategy & v1.0→v1.1 score changes across all models #99

Description

1. ESM-1b Scoring Strategy

2. Score Changes Between v1.0 and v1.1

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions