ProteinBias: Benchmarking Species Bias in Protein Models

A standardized framework to evaluate and quantify species bias in protein sequence models.

What is Species Bias?

Many protein sequence models exhibit species bias – they systematically assign higher scores (e.g., likelihoods) to sequences from certain species (often mammals) and lower scores to others (often Archaea), even for the same protein. This can affect the interpretation of these scores across the tree of life, and the bias can be detrimental for some protein design applications.

What This Tool Does

ProteinBias uses an Elo rating system to quantify species bias in any protein scoring model:

✅ Easy benchmarking: Compare your model against established baselines with one command
✅ Quantitative bias measurement: Get precise Elo ratings showing which species your model favors
✅ Reproducible results: Validate your setup by reproducing published baseline results

Paper: This framework accompanies our paper: Protein language models are biased by unequal sequence sampling across the tree of life

How It Works

ProteinBias measures bias by comparing how your model scores orthologous proteins (same protein function, different species):

Load protein sequences from diverse species that share the same protein name in UniProt
Score sequences using your model (or use our baseline models)
Run Elo tournaments where species "compete" based on their protein sequence scores
Analyze Elo rating results to see which species received high Elo ratings (i.e., systematically higher model scores)
Get bias metrics to compare models on their Elo rating distributions

Large Elo rating spread = more bias in the model, with the caveat that bias in protein scoring models can sometimes be desirable, depending on the application (see paper for discussion).

Technical details: Species start at Elo 1500, then "compete" in pairwise matchups based on model scores. Winners gain points, losers lose points using the standard Elo algorithm. Results are averaged across multiple random tournament orderings for stability.

Installation

# Clone the repository
git clone <repository-url>
cd proteinbias

# Install with uv (recommended)
uv sync
source .venv/bin/activate
uv pip install -e .[dev]

# Or install with pip
pip install -e .[dev]

Quick Start

1. First, validate your setup

# Test with existing baseline models (ProGen2, ESM2 variants)
proteinbias run-baseline --output baseline_results.csv --summary baseline_summary.csv

This ensures everything works by reproducing published results. Compare the outputs to data/core/baseline_results.csv and data/core/baseline_summary.csv.

2. Benchmark your own model

Step 1: Create your model class

# my_model.py
from proteinbias import BaseModel

class MyModel(BaseModel):
    def __init__(self):
        super().__init__("MyModel")
    
    def score_sequence(self, sequence: str) -> float:
        # Your scoring logic here (higher = better)
        return your_scoring_function(sequence)

Step 2: Run the benchmark

proteinbias run-benchmark \
    --model-file my_model.py \
    --model-class MyModel \
    --output results.csv \
    --summary summary.csv

3. Interpret your results

The results.csv shows Elo ratings for each species and the standard error of the mean Elo rating across replicates:

genus_species	Elo_Progen2_xlarge_ll_mean	Elo_Progen2_xlarge_ll_SE	Elo_MyModel_mean	Elo_MyModel_SE
Homo sapiens	2405.3	12.5	1547.3	8.0
Halobacterium salinarum	1113.3	27.7	1492.1	6.5

Key insights:

1500 = neutral rating (no bias)
Higher ratings = your model favors this species
Large spread across species = more bias in your model

The summary.csv shows aggregated bias metrics to compare models at a high level:

model	range	std_dev	iqr	eukaryota_mean	bacteria_mean	archaea_mean	mammalia_mean
Progen2_xlarge_ll	2040.0	563.2	612.3	1562.9	1523.1	846.0	2148.2
MyModel	345.8	62.3	89.2	1504.2	1478.1	1423.5	1512.7

In this example, MyModel seems to show less bias than ProGen2 (smaller range and IQR, lower standard deviation). Note that lower bias is not necessarily always desirable -- see the paper for discussion of interpreting bias.

Advanced Usage

Score Caching

Scores are automatically cached and reused, so if a model is slow at scoring protein sequences, you only need to run scoring once:

# First run: computes and caches scores, then runs Elo computation
proteinbias run-benchmark --model-file my_model.py --model-class MyModel

# Second run: uses cached scores (instant! then runs Elo computation)
proteinbias run-benchmark --model-file my_model.py --model-class MyModel

Alternative workflows:

You can also compute scores outside of this benchmark and add these scores to the cache:

# Add pre-computed scores from file
proteinbias add-scores my_scores.csv "MyModel"

# Run benchmark using only cached scores (multiple models)
proteinbias run-cached-benchmark "Model1,Model2,Model3" --output results.csv

# List cached models
proteinbias list-scores

# Export scores for a model
proteinbias export-scores MyModel my_model_scores.csv

Score file format:

sequence_id	score
P12345	1.23
Q67890	0.87

Note: The cache system uses sequence IDs (from the UniProt Entry column of the default dataset) for efficient lookup.

Batch Processing

If your model supports batch processing, override the score_sequences method:

class MyBatchModel(BaseModel):
    def score_sequences(self, sequences: List[str]) -> List[float]:
        # Implement efficient batch scoring
        return batch_score(sequences)

Configuration Options

# Run with custom parameters
proteinbias run-benchmark \
    --model-file my_model.py \
    --num-replicates 100 \
    --k-factor 32 \
    --n-jobs 8 \
    --output results.csv \
    --summary summary.csv

Model Implementation Examples

See examples/example_model.py for several example implementations:

RandomModel: Simple random baseline
SequenceLengthModel: Length-based scoring
AminoAcidCompositionModel: Composition-based scoring

Default Dataset

The default dataset is the one used for analysis in our paper, comprised of two files:

Protein Sequences: data/core/curated_protein_sequences.csv - sequences and metadata
Baseline Scores: data/core/baseline_scores.csv - ProGen2 and ESM2 model scores for sequences in the above file

Additionally, data used for the "time to the last common ancestor" analysis in the paper can be found at data/core/selected_species_with_time_to_common_ancestor_matrix.csv.

Currently the benchmark only supports the default dataset, but contributions of new datasets would be welcome!

Community Contributions

Share your model results with the research community!

Community models are organized in individual directories under data/community/. Each model gets its own subdirectory containing:

model.py - Model implementation
scores.csv - Pre-computed scores for the benchmark dataset
summary.csv - Benchmark results and bias statistics

See data/community/README.md for detailed contribution guidelines.

# View all model results (baseline + community)
proteinbias list-results

Important: The list-results command only shows:

Baseline models (from the default benchmark dataset)
Community models (from data/community/ subdirectories)

It combines the summary.csv files for different models, and does not depend on the raw scores. To see your own model in list-results, you must add your benchmark summary to the community directory as described in the contribution guidelines above.

CLI Reference

All Commands (click to expand)

Benchmarking:

proteinbias run-baseline [--output] [--summary] [--num-replicates] [--k-factor]
proteinbias run-benchmark [--model-file] [--model-class] [--output] [--summary]
proteinbias run-cached-benchmark <models> [--output] [--summary]

Score Management:

proteinbias add-scores <file> <model_name> [--overwrite]
proteinbias list-scores
proteinbias export-scores <model_name> <output_file>
proteinbias clear-cache [--model-name <name> | --all]

Model Results:

proteinbias list-results [--sort-by range|std|iqr|model_name|none]

Contributing

Fork the repository
Create a feature branch
Install dev requirements and pre-commit tool

uvx pre-commit install

Add your improvements
Submit a pull request
Thank you!

License

MIT License - see LICENSE file for details.

Citation

If you use this benchmarking framework in your research, please cite:

@article{ding2024protein,
  title={Protein language models are biased by unequal sequence sampling across the tree of life},
  author={Ding, Frances and Steinhardt, Jacob},
  journal={BioRxiv},
  pages={2024--03},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
examples		examples
proteinbias		proteinbias
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProteinBias: Benchmarking Species Bias in Protein Models

What is Species Bias?

What This Tool Does

How It Works

Installation

Quick Start

1. First, validate your setup

2. Benchmark your own model

3. Interpret your results

Advanced Usage

Score Caching

Batch Processing

Configuration Options

Model Implementation Examples

Default Dataset

Community Contributions

CLI Reference

Contributing

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ProteinBias: Benchmarking Species Bias in Protein Models

What is Species Bias?

What This Tool Does

How It Works

Installation

Quick Start

1. First, validate your setup

2. Benchmark your own model

3. Interpret your results

Advanced Usage

Score Caching

Batch Processing

Configuration Options

Model Implementation Examples

Default Dataset

Community Contributions

CLI Reference

Contributing

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages