Paper: Explicit representation of germline and non-germline residues improves antibody language modeling
"Are the baseline comparisons still favorable after tightly matching model size, training data, and antibody-specific supervision? This would better isolate the contribution of the proposed factorization."
Figure R1. Pseudo-perplexity comparison between PRISM and a tightly matched PRISM-less baseline (same ESM2-35M architecture, same OAS training data, same 2-stage training protocol, but without GL/NGL factorization). (Top) Mean PPL stratified by chain, region, and GL/NGL status (OAS test set, 22,591 sequences). PRISM-less achieves substantially higher PPL across all categories, with CDR3 NGL positions showing the most dramatic gap (Heavy: 8.70 vs 56.53; Light: 5.19 vs 72.27). (Bottom) Outlier statistics confirm that PRISM-less struggles disproportionately at somatic hypermutation sites, demonstrating that the GL/NGL factorization is critical for modeling non-germline residues.
Figure R2. GL/NGL residue-level discrimination via linear probes trained on frozen embeddings (5,000 test sequences, ~1.15M residues). Despite training on the same antibody data, PRISM-less (F1 = 0.784, PR-AUC = 0.924) underperforms PRISM (F1 = 0.896, PR-AUC = 0.980), confirming that the explicit origin head and GL/NGL token factorization produce embeddings with stronger somatic mutation signal than can be learned implicitly through standard MLM training.
Figure R3. Zero-shot binding affinity prediction comparing PRISM to the matched PRISM-less baseline. (A) 3DMS benchmark (3 proteins, 106K variants): PRISM achieves positive Spearman rho across all datasets, while PRISM-less produces negative correlations on G6.31 (rho = -0.115) and CR9114 (rho = -0.409), indicating that standard masked LLR systematically mispredicts mutation effects by favoring germline reversion. (B) FLAb2 binding (62 proteins): PRISM mean rho = +0.066 vs PRISM-less = -0.044. The sign reversal demonstrates that GL/NGL factorization is essential for correct binding affinity prediction — without it, the model conflates evolutionary conservation with functional fitness.
Figure R4. Zero-shot developability prediction comparing PRISM to the matched PRISM-less baseline. (A) Ginkgo benchmark (6 properties, 193-242 antibodies): PRISM achieves consistently positive correlations, while PRISM-less shows negative rho for AC-SINS (-0.126), HIC (-0.036), and Expression/Titer (-0.190), indicating that vanilla pseudo-perplexity fails to capture the directionality of biophysical properties. Immunogenicity (ADA) is the exception where both models perform comparably (0.310 vs 0.326), consistent with ADA being directly related to sequence novelty. (B) FLAb2 developability (5 properties): PRISM-less produces strongly negative rho for thermostability (-0.276) and expression (-0.530), whereas PRISM maintains positive correlations. These results demonstrate that the GL/NGL factorization enables PRISM to disentangle evolutionary conservation from biophysical fitness — a distinction that standard language models cannot capture.
"Do the reported affinity and developability gains remain consistent across a broader set of benchmarks or per-assay statistical tests? This would clarify robustness."
Figure R5. Summary of zero-shot binding affinity prediction on the FLAb2 benchmark (45 proteins, 17,351 variants). (A) Fraction of proteins where each model achieves a positive Spearman correlation between predicted and measured binding fitness. PRISM correctly predicts binding direction for 60% of proteins, compared to 38--51% for baselines. (B) Mean Spearman rho across all 45 proteins. PRISM is the only model with a positive mean rho (+0.066), winning 57% of pairwise comparisons against baselines. Baselines evaluated using log-likelihood ratios (LLR).
Figure R6. Per-protein Spearman rho for binding affinity prediction across 45 FLAb2 DMS datasets, comparing PRISM against ESM2-35M, ESM2-650M, AbLang2, AntiBERTy, and Sapiens. Rows are sorted by PRISM performance (ascending). Red = positive correlation (correct direction), blue = negative. PRISM shows the most consistently positive correlations across proteins.
Figure R7. Summary of zero-shot developability prediction robustness across 30 FLAb2 assay types. (A) Fraction of assays where each model achieves a positive directed Spearman rho (higher score = better developability). PRISM predicts the correct direction for 74% of assays, compared to 32--48% for baselines. (B) Mean directed rho across all 30 assays. PRISM is the only model with a positive mean (+0.076), winning 73% of pairwise comparisons. These results demonstrate that PRISM's developability prediction advantage is consistent across a broad set of independent assays, not limited to any single benchmark.
Figure R8. Directed Spearman rho between model scores and biophysical fitness across 30 assay types spanning 5 developability property categories (aggregation/self-interaction, thermostability, immunogenicity, polyreactivity, expression). Direction-corrected so that positive values always indicate favorable prediction (higher model score = better developability). PRISM uses property-specific optimal signals; baselines use pseudo-log-likelihood (PLL). Antibody counts per assay shown in parentheses. PRISM shows broadly positive correlations across assays, while baselines are inconsistent.
"How much overlap exists between NGL residues and CDRs? As many antibody generation methods that design CDR conditioned on fixed FR, what would happen if all CDR residues were marked as NGL while all FR residues were masked as GL?"
Figure R9. Comparison of PRISM's per-residue origin labels against a region-level heuristic (CDR=NGL, FR=GL). (a) GL/NGL distribution across FR and CDR positions in the 63M unpaired OAS training sequences (7.2B residue positions). NGL residues constitute only 8.8% of FR and 14.5% of CDR positions — even in CDRs, 85.5% of residues remain germline-conserved. The CDR/FR enrichment ratio is only 1.64×, indicating that NGL mutations are not concentrated in CDRs as the region heuristic assumes. (b–c) Pseudo-perplexity on the OAS test set (22,591 paired sequences), stratified by FR/CDR and true GL/NGL status for heavy and light chains separately. Per-residue origin labels (blue) are compared against region-forced labels (red). At CDR-GL positions, region forcing causes 8.7× (heavy) and 6.9× (light) PPL degradation by incorrectly applying the NGL head to germline-conserved residues. At FR-NGL positions, the two approaches perform comparably (~1.1×). Both chains exhibit the same pattern, demonstrating that PRISM's origin head captures sub-region granularity — distinguishing GL from NGL within CDRs and FRs — that a simple region heuristic cannot.
"It would be helpful to compare PRISM with IgLM, an autoregressive generative model for antibody modeling and design, which is already cited in the introduction."
Figure R10. IgLM pseudo-perplexity distributions across antibody regions. Pseudo-perplexity approximated via masking 1 position at a time and using the infill_range parameter when calculating log likelihoods (Left) Overall pseudo-perplexity for heavy and light chains. (Middle) Framework & germline (FR & GL) regions. (Right) CDR3 & non-germline (NGL) regions, where pseudo-perplexity increases substantially for both chains reflecting higher sequence diversity and model uncertaint.
Figure R11. Summary of IgLM zero-shot performance. (1st row) IgLM spearman correlations for developability properties. We scored antibody sequences for thermostability (tm2 nanodsf avg), polyreactivity (polyreactivity prscore cho avg), hydrophobicity (hic rt avg), expression (titer avg), and immunogenecitiy (anti-drug antibody values) using the same datasets used in the primary manuscript. (2nd row) IgLM spearman correlations for binding affinity of cd9114, g6.31, and trastuzumab sequence variants.
"In the unnormalized log-odds calculation (Equation 6), the numerical dynamics of linearly combining logits and log-probabilities require a more in-depth discussion. Does α_i serve as a confidence decay factor in a Bayesian update context, or is it merely an empirical scaling temperature? The manuscript would benefit from including distribution plots of the Alpha-gating values across the early, middle, and late stages of training."
| Region | Origin Accuracy | NGL Sensitivity | NGL Rate | Median α |
|---|---|---|---|---|
| FR4 | 0.944 | 0.005 | 0.039 | 0.962 |
| CDR3 | 0.892 | 0.069 | 0.068 | 0.991 |
| FR1 | 0.852 | 0.109 | 0.047 | 0.897 |
| FR2 | 0.810 | 0.191 | 0.079 | 0.848 |
| FR3 | 0.790 | 0.158 | 0.112 | 0.882 |
| CDR1 | 0.628 | 0.384 | 0.218 | 0.937 |
| CDR2 | 0.527 | 0.474 | 0.278 | 0.958 |
CDR2 has the lowest origin accuracy (0.527) yet the second-highest α (0.958). Pearson correlation between per-position α and origin accuracy: r=0.10, p=0.22 (not significant). Instead, α correlates with NGL mutation rate (r=0.22, p=0.009), supporting the interpretation that α acts as a task-relevance gate rather than a confidence decay factor.
Figure R12. Per-IMGT-position α boxplots at the best checkpoint for both heavy and light chains, illustrating the region-dependent gating structure. CDR regions (red) consistently show higher α than framework regions (blue), indicating that GL/NGL origin information is most utilized where somatic hypermutation is concentrated. The dashed line at α=0.5 marks the initialization value.
Figure R13. Per-IMGT-position α profiles (median with IQR ribbon) across four training stages for both heavy and light chains (n=300 paired sequences). The dashed line at α=0.5 indicates the initialization value. Three phases are evident: (1) Pretraining — α is moderate overall (median 0.684) but already elevated at CDR3 (0.958), indicating that the model discovers the high task-relevance of origin information at CDR3 purely from the MLM objective; (2) Early finetuning (best checkpoint) — α rises sharply across all regions (median 0.922), reflecting the increased utility of GL/NGL discrimination on paired antibody data; (3) Late finetuning (overfitting) — α collapses back toward (and in FR regions below) the initialization value, as the AA head memorizes training sequences and origin information becomes redundant. CDR3 remains the most resistant to this collapse (0.878 vs. 0.567 overall).
| Training Stage | FR1 | FR2 | CDR2 | CDR3 | Overall |
|---|---|---|---|---|---|
| Pretrain (ep. 4) | 0.683 | 0.607 | 0.720 | 0.958 | 0.684 |
| Finetune ep. 23 (best) | 0.894 | 0.848 | 0.960 | 0.991 | 0.922 |
| Finetune ep. 53 | 0.521 | 0.510 | 0.743 | 0.865 | 0.567 |
| Finetune ep. 74 | 0.544 | 0.537 | 0.749 | 0.878 | 0.608 |
"The authors utilize a Stop-Gradient operator during conditional injection to maintain the 'evolutionary purity' of the Origin Head. However, it is worth investigating whether disconnecting the gradient flow subjects the AA Head to volatile conditional inputs (covariate shift) during the early stages of training. Have any instabilities been observed during initial training phases?"
Figure R14. Controlled ablation comparing fixed α=1.0 (origin signal injected at full strength, no learned modulation) versus learned α over 500 optimizer steps of pretraining (identical hyperparameters, data, and seed). (A1) Training loss curves are visually indistinguishable, confirming no instability from the stop-gradient operator. (A2) Origin head loss converges rapidly (0.089→0.033), reaching near-final values before the LR warmup period ends (dashed line), which minimizes the window of volatile conditional inputs. (B1–B2) Rolling standard deviation (window=20) of total and final loss; mean volatility is identical between conditions (0.671 vs. 0.672). (C1) Learned α trajectory showing emergence of region-specific gating from initialization (0.5), with CDR reaching 0.98 and FR plateauing at 0.93 by step 500. (C2) AA Head NGL perplexity diverges substantially: learned α achieves 2.35 vs. 4.20 for fixed, indicating that learned α enables the AA Head to develop stronger intrinsic NGL modeling rather than delegating to the origin signal.
"While PRISM successfully shifts the Spearman correlation for G6.31 from negative to positive, the absolute value remains very weak (ρ≈0.085). The text should more objectively discuss the actual 'enrichment efficiency' such a low positive correlation might provide in high-throughput physical screening (e.g., phage display) to avoid overstating its practical value for therapeutic guidance."
| Model | Spearman ρ | Recall@1% | Recall@5% | Recall@10% | Enrich@10% |
|---|---|---|---|---|---|
| ESM2-650M | -0.1301 | 0.047 | 0.065 | 0.171 | 1.7x |
| AntiBERTy | +0.0712 | 0.023 | 0.079 | 0.138 | 1.4x |
| Sapiens | +0.0101 | 0.023 | 0.070 | 0.121 | 1.2x |
| PRISM | +0.0853 | 0.023 | 0.070 | 0.114 | 1.1x |
| AbLang2 | -0.0190 | 0.000 | 0.033 | 0.110 | 1.1x |
| ESM2-35M | -0.2038 | 0.023 | 0.070 | 0.107 | 1.1x |
| Random | — | 0.010 | 0.050 | 0.100 | 1.0x |
PRISM is the only model with a positive Spearman ρ (+0.085). All baselines with negative ρ (ESM2-35M, ESM2-650M, AbLang2) yield enrichment driven by noise rather than correct ranking direction. Maximum enrichment across all models is 1.7x.
| Model | Spearman ρ | Recall@1% | Recall@5% | Recall@10% | Enrich@10% |
|---|---|---|---|---|---|
| PRISM | +0.3933 | 0.035 | 0.170 | 0.272 | 2.7x |
| AbLang2 | -0.1433 | 0.002 | 0.025 | 0.082 | 0.8x |
| ESM2-650M | -0.4381 | 0.003 | 0.015 | 0.026 | 0.3x |
| ESM2-35M | -0.4411 | 0.002 | 0.010 | 0.022 | 0.2x |
| Sapiens | -0.3093 | 0.002 | 0.013 | 0.022 | 0.2x |
| AntiBERTy | -0.4084 | 0.002 | 0.006 | 0.017 | 0.2x |
| Random | — | 0.010 | 0.050 | 0.100 | 1.0x |
PRISM is the only model with a positive Spearman ρ (+0.393), achieving 2.7x enrichment at 10%. All baselines have negative ρ — selecting their highest-scoring variants produces anti-enrichment (0.2–0.3x, worse than random).
| Model | Spearman ρ | Recall@1% | Recall@5% | Recall@10% | Enrich@10% |
|---|---|---|---|---|---|
| AntiBERTy | +0.4150 | 0.025 | 0.188 | 0.324 | 3.2x |
| Sapiens | +0.3788 | 0.027 | 0.179 | 0.305 | 3.0x |
| ESM2-650M | +0.3622 | 0.027 | 0.175 | 0.304 | 3.0x |
| PRISM | +0.3656 | 0.036 | 0.175 | 0.301 | 3.0x |
| ESM2-35M | +0.3677 | 0.033 | 0.173 | 0.300 | 3.0x |
| AbLang2 | -0.1103 | 0.003 | 0.018 | 0.050 | 0.5x |
| Random | — | 0.010 | 0.050 | 0.100 | 1.0x |
All models except AbLang2 achieve positive ρ and ~3.0x enrichment at 10%. PRISM performs on par with ESM2 and antibody-specific baselines.
"While zero-shot affinity and reconstruction are persuasive, additional downstream design tasks (e.g., controlled multi-objective optimization or prospective validation) would better demonstrate practical utility."
Figure R15. We trained ridge regression models on mean-pooled AntiBERTy embeddings from paired heavy and light chains. For each antibody, raw embeddings were extracted for each chain, mean pooled across residues, concatenated, standardized, and used as input to a ridge regressor with
$\alpha = 1.0$ . Models were evaluated with predefined 5-fold cross-validation on the DMS datasets and then retrained on all available data to obtain final reward models for scoring generated sequences.
Figure R16. PLL-guided antibody variant generation for G6.31 (anti-VEGF, PDB 2FJH) under a fixed decoding budget (100 variants, ≤3 mutations, seed=42). Top row: KDE distributions of four metrics, all oriented so that rightward = better. Binding metrics (left two): Rosetta interface ΔΔG and Ridge regression Δ binding affinity. Developability metrics (right two): Rosetta folding ΔΔG and CamSol Δ solubility. Win rate table: Fraction of generated variants that improve over wild-type for each metric. PRISM-GL dominates solubility (88%), PRISM-Full achieves the best binding–stability balance (52% binding, 32% stability), and IgLM leads structural binding (51%) but with poor ML binding (7%). Bottom rows: Pareto front analysis for all four pairwise binding × developability combinations. Each point is one generated variant; step lines trace the Pareto-optimal frontier per model. By switching PRISM's decoding head — GL (germline-biased, blue), NGL (mutation-biased, red), or Full (region-specific α-gated, green) — the Pareto front shifts systematically, demonstrating controllability absent in baselines. All models use identical PLL-guided sampling with Gumbel-Top-k position selection.
Figure R17. Same experiment as Figure R16 for Trastuzumab (anti-HER2, PDB 1N8Z). The win rate table reveals that ESM2 achieves the highest structural binding win rate (61%) and stability (37%), while PRISM-GL leads solubility (74%) and IgLM achieves near-universal solubility improvement (94%). PRISM-Full provides the most balanced profile across all four metrics (37% binding, 22% ML binding, 26% stability, 51% solubility). Notably, no single model dominates all axes, illustrating the multi-objective trade-off that PRISM's factorized GL/NGL/Full heads can navigate by design.
Figure R18. Same experiment as Figure R16 for CR9114 (anti-influenza HA, PDB 4FQI). The win rate table highlights a stark contrast: PRISM modes achieve substantial win rates across all metrics (PRISM-GL: 55% ML binding, 38% stability, 52% solubility; PRISM-Full: 49% ML binding, 50% solubility), whereas baselines collapse — AbLang2 and ESM2 show ≤2% stability win rate and AbLang2 only 1% solubility. This antibody-specific gap underscores the advantage of GL/NGL-factorized decoding for challenging targets. Note: PDB crystal structure differs from the DMS wild-type at several positions, leading to higher skip rates for some models; only mutations at matching positions are scored.
"The approach relies on accurate GL/NGL labeling and consistent germline templates; sensitivity to annotation errors or alternative germline callers is not fully explored."
Figure R19. Pseudo-perplexity (marginalized) stratified by chain, region, and GL/NGL status under clean, +Noise2, and +Noise4 supervision (OAS test set, 22,591 sequences). (Top) Mean PPL for overall, FR GL, and CDR3 NGL positions. PRISM achieves the lowest PPL across all categories. Even +Noise4 remains below the best baseline (dashed) at CDR3 NGL positions (Heavy: 9.84 vs. AbLang2 10.65; Light: 6.79 vs. 7.38), indicating that functionally relevant representations are preserved despite noisy supervision. (Bottom) Outlier statistics (95th/99th percentile PPL and fraction of tokens with PPL > 20) confirm that noise primarily affects tail behavior rather than typical predictions.
Figure R20. GL/NGL residue-level discrimination via linear probes trained on frozen embeddings (5,000 test sequences, ~1.15M residues). PRISM clean achieves F1 = 0.896 and PR-AUC = 0.980. Under +Noise2, PR-AUC remains 0.880 — substantially above the best baseline (AntiBERTy: 0.785), demonstrating that approximate origin supervision still produces embeddings with stronger GL/NGL signal than models without such supervision.
Figure R21. Zero-shot binding affinity prediction under label noise. (A) 3DMS benchmark (3 proteins, 106K variants): +Noise2 maintains positive Spearman ρ across all datasets, with Trastuzumab (ρ = 0.357) matching the clean model. The +Noise4 sign reversal on CR9114 (ρ = −0.362) confirms that GL/NGL supervision is genuinely driving the gains. (B) FLAb2 binding (41 proteins): +Noise2 retains positive mean correlation (ρ = 0.022 vs. clean 0.066).
Figure R22. Zero-shot developability prediction under label noise. (A) Ginkgo benchmark (6 properties): immunogenicity (ADA) is notably robust (ρ = 0.310 → 0.294 → 0.292), and +Noise2 stays near or above baselines (dashed) for most properties. (B) FLAb2 developability (5 properties, per-assay mean Spearman ρ): +Noise2 remains above the best baseline (Sapiens, dashed) for thermostability (0.385 vs. 0.317) and self-interaction (0.354 vs. 0.288). In summary, PRISM exhibits graceful degradation proportional to noise level, retaining most advantages at realistic noise (~15%) while sharp decline at ~30% confirms that GL/NGL supervision is a genuine, load-bearing component.





















