A standardized framework to evaluate and quantify species bias in protein sequence models.
Many protein sequence models exhibit species bias – they systematically assign higher scores (e.g., likelihoods) to sequences from certain species (often mammals) and lower scores to others (often Archaea), even for the same protein. This can affect the interpretation of these scores across the tree of life, and the bias can be detrimental for some protein design applications.
ProteinBias uses an Elo rating system to quantify species bias in any protein scoring model:
- ✅ Easy benchmarking: Compare your model against established baselines with one command
- ✅ Quantitative bias measurement: Get precise Elo ratings showing which species your model favors
- ✅ Reproducible results: Validate your setup by reproducing published baseline results
Paper: This framework accompanies our paper: Protein language models are biased by unequal sequence sampling across the tree of life
ProteinBias measures bias by comparing how your model scores orthologous proteins (same protein function, different species):
- Load protein sequences from diverse species that share the same protein name in UniProt
- Score sequences using your model (or use our baseline models)
- Run Elo tournaments where species "compete" based on their protein sequence scores
- Analyze Elo rating results to see which species received high Elo ratings (i.e., systematically higher model scores)
- Get bias metrics to compare models on their Elo rating distributions
Large Elo rating spread = more bias in the model, with the caveat that bias in protein scoring models can sometimes be desirable, depending on the application (see paper for discussion).
Technical details: Species start at Elo 1500, then "compete" in pairwise matchups based on model scores. Winners gain points, losers lose points using the standard Elo algorithm. Results are averaged across multiple random tournament orderings for stability.
# Clone the repository
git clone <repository-url>
cd proteinbias
# Install with uv (recommended)
uv sync
source .venv/bin/activate
uv pip install -e .[dev]
# Or install with pip
pip install -e .[dev]# Test with existing baseline models (ProGen2, ESM2 variants)
proteinbias run-baseline --output baseline_results.csv --summary baseline_summary.csvThis ensures everything works by reproducing published results. Compare the outputs to data/core/baseline_results.csv and data/core/baseline_summary.csv.
Step 1: Create your model class
# my_model.py
from proteinbias import BaseModel
class MyModel(BaseModel):
def __init__(self):
super().__init__("MyModel")
def score_sequence(self, sequence: str) -> float:
# Your scoring logic here (higher = better)
return your_scoring_function(sequence)Step 2: Run the benchmark
proteinbias run-benchmark \
--model-file my_model.py \
--model-class MyModel \
--output results.csv \
--summary summary.csvThe results.csv shows Elo ratings for each species and the standard error of the mean Elo rating across replicates:
| genus_species | Elo_Progen2_xlarge_ll_mean | Elo_Progen2_xlarge_ll_SE | Elo_MyModel_mean | Elo_MyModel_SE |
|---|---|---|---|---|
| Homo sapiens | 2405.3 | 12.5 | 1547.3 | 8.0 |
| Halobacterium salinarum | 1113.3 | 27.7 | 1492.1 | 6.5 |
Key insights:
- 1500 = neutral rating (no bias)
- Higher ratings = your model favors this species
- Large spread across species = more bias in your model
The summary.csv shows aggregated bias metrics to compare models at a high level:
| model | range | std_dev | iqr | eukaryota_mean | bacteria_mean | archaea_mean | mammalia_mean |
|---|---|---|---|---|---|---|---|
| Progen2_xlarge_ll | 2040.0 | 563.2 | 612.3 | 1562.9 | 1523.1 | 846.0 | 2148.2 |
| MyModel | 345.8 | 62.3 | 89.2 | 1504.2 | 1478.1 | 1423.5 | 1512.7 |
In this example, MyModel seems to show less bias than ProGen2 (smaller range and IQR, lower standard deviation). Note that lower bias is not necessarily always desirable -- see the paper for discussion of interpreting bias.
Scores are automatically cached and reused, so if a model is slow at scoring protein sequences, you only need to run scoring once:
# First run: computes and caches scores, then runs Elo computation
proteinbias run-benchmark --model-file my_model.py --model-class MyModel
# Second run: uses cached scores (instant! then runs Elo computation)
proteinbias run-benchmark --model-file my_model.py --model-class MyModelAlternative workflows:
You can also compute scores outside of this benchmark and add these scores to the cache:
# Add pre-computed scores from file
proteinbias add-scores my_scores.csv "MyModel"
# Run benchmark using only cached scores (multiple models)
proteinbias run-cached-benchmark "Model1,Model2,Model3" --output results.csv
# List cached models
proteinbias list-scores
# Export scores for a model
proteinbias export-scores MyModel my_model_scores.csvScore file format:
| sequence_id | score |
|---|---|
| P12345 | 1.23 |
| Q67890 | 0.87 |
Note: The cache system uses sequence IDs (from the UniProt Entry column of the default dataset) for efficient lookup.
If your model supports batch processing, override the score_sequences method:
class MyBatchModel(BaseModel):
def score_sequences(self, sequences: List[str]) -> List[float]:
# Implement efficient batch scoring
return batch_score(sequences)# Run with custom parameters
proteinbias run-benchmark \
--model-file my_model.py \
--num-replicates 100 \
--k-factor 32 \
--n-jobs 8 \
--output results.csv \
--summary summary.csvSee examples/example_model.py for several example implementations:
- RandomModel: Simple random baseline
- SequenceLengthModel: Length-based scoring
- AminoAcidCompositionModel: Composition-based scoring
The default dataset is the one used for analysis in our paper, comprised of two files:
- Protein Sequences:
data/core/curated_protein_sequences.csv- sequences and metadata - Baseline Scores:
data/core/baseline_scores.csv- ProGen2 and ESM2 model scores for sequences in the above file
Additionally, data used for the "time to the last common ancestor" analysis in the paper can be found at data/core/selected_species_with_time_to_common_ancestor_matrix.csv.
Currently the benchmark only supports the default dataset, but contributions of new datasets would be welcome!
Share your model results with the research community!
Community models are organized in individual directories under data/community/. Each model gets its own subdirectory containing:
model.py- Model implementationscores.csv- Pre-computed scores for the benchmark datasetsummary.csv- Benchmark results and bias statistics
See data/community/README.md for detailed contribution guidelines.
# View all model results (baseline + community)
proteinbias list-resultsImportant: The list-results command only shows:
- Baseline models (from the default benchmark dataset)
- Community models (from
data/community/subdirectories)
It combines the summary.csv files for different models, and does not depend on the raw scores. To see your own model in list-results, you must add your benchmark summary to the community directory as described in the contribution guidelines above.
All Commands (click to expand)
Benchmarking:
proteinbias run-baseline [--output] [--summary] [--num-replicates] [--k-factor]
proteinbias run-benchmark [--model-file] [--model-class] [--output] [--summary]
proteinbias run-cached-benchmark <models> [--output] [--summary]Score Management:
proteinbias add-scores <file> <model_name> [--overwrite]
proteinbias list-scores
proteinbias export-scores <model_name> <output_file>
proteinbias clear-cache [--model-name <name> | --all]Model Results:
proteinbias list-results [--sort-by range|std|iqr|model_name|none]- Fork the repository
- Create a feature branch
- Install dev requirements and pre-commit tool
uvx pre-commit install- Add your improvements
- Submit a pull request
- Thank you!
MIT License - see LICENSE file for details.
If you use this benchmarking framework in your research, please cite:
@article{ding2024protein,
title={Protein language models are biased by unequal sequence sampling across the tree of life},
author={Ding, Frances and Steinhardt, Jacob},
journal={BioRxiv},
pages={2024--03},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}