Skip to content

allenai/S2AFF

Repository files navigation

S2AFF - Semantic Scholar Affiliations Linker

This repository contains code that links raw affiliation strings to ROR ids.

This is a 3-stage model:

  1. Raw affiliation strings are run through a NER transformer that parses them into main institute, child institute, and address components.
  2. A retrieval index fetches top 100 candidates. This component has a MRR of about 0.90 on our gold data, with a recall@100 of 0.984.
  3. A pairwise, feature-based LightGBM model does reranking of the top 100 candidates. This component has a precision@1 of about 0.97 and a MRR of 0.98.

The model has some heuristic rules for predicting "no ROR found". These are: (1) if the score of the top candidate is below 0.3 and (2) if the difference in the top score and the second top score is less than 0.2. These were found by empirical exploration. See scripts/analyze_thresholds.ipynb for more details.

Installation

Quickstart (Python-only)

git clone https://github.com/allenai/S2AFF.git
cd S2AFF
uv python install 3.11.13
uv venv --python 3.11.13

uv pip install -e .

No activation required; commands below use uv run --python .venv.

Download the models (expected size ~3.4 GiB):

AWS_REGION=us-west-2 aws s3 sync --no-sign-request s3://ai2-s2-research-public/s2aff-release data/

Optional: Rust acceleration (faster)

Rust is optional. If installed, it becomes the default pipeline and S2AFF falls back to Python when Rust is unavailable.

Build requirements

  • Rust toolchain (rustup)
  • maturin (handled automatically by uv pip install -e s2aff_rust)

macOS/Linux

# install rust (if needed)
curl https://sh.rustup.rs -sSf | sh
source $HOME/.cargo/env

# from repo root
uv pip install -e s2aff_rust

Windows (PowerShell)

# install rust (if needed) via rustup-init.exe or winget
# https://rustup.rs

$env:PATH="$env:USERPROFILE\.cargo\bin;$env:PATH"
uv pip install -e s2aff_rust

Notes & troubleshooting

  • PyTorch installs via the CUDA wheel index in pyproject.toml, so uv pip install -e . will pull GPU wheels when available and fall back to CPU-only wheels otherwise.
  • If you need a specific CUDA build, install torch explicitly before running the editable install.
  • If you see rustc is not installed or not in PATH, ensure <wherever>\.cargo\bin is on PATH (or open a new shell after installing rustup).
  • First Rust run builds a cache under ~/.s2aff/indices/rust/ror_index_<hash>.bin.
  • Use S2AFF_RUST_LOG=1 for verbose Rust timing logs.
  • Runtime toggles:
    • macOS/Linux: export S2AFF_PIPELINE=python|rust
    • Windows (PowerShell): $env:S2AFF_PIPELINE="python|rust"

Updating the ROR Database

S2AFF can read ROR data from local files or from the public S3 release bucket. The workflow depends on whether you want a local-only update or to publish a new default version.

What S2AFF uses by default

  • On import, s2aff/consts.py tries to discover the latest *-ror-data.json in the public S3 bucket.
  • If S3 discovery fails, it falls back to DEFAULT_ROR_VERSION.
  • You can override this with S2AFF_ROR_VERSION (recommended for local testing).

Local-only update (no S3 publish)

  1. Download the latest ROR file:
    • uv run --python .venv python data/download_latest_ror.py --out-dir data
  2. Point S2AFF at the new file:
    • S2AFF_ROR_VERSION=<new-version> (e.g., v2.1-2026-01-15)
    • or update DEFAULT_ROR_VERSION in s2aff/consts.py
  3. (Optional) Refresh OpenAlex works counts:
    • uv run --python .venv python scripts/update_openalex_works_counts.py
    • You do not need to do this for every ROR update.

Publish a new default version to S3 (AI2 workflow)

There are two options; use one of them:

  1. CI job (preferred at AI2):
  2. Manual upload:
    • uv run --python .venv python update_ror.py
    • This pulls the latest ROR ZIP from Zenodo, extracts the JSON, and uploads to S3.

Note: new S2AFF deployments pick up the S3 default automatically, but the Timo s2aff endpoint does not. To update Timo, publish a new S2AFF release on PyPI that points to the new ROR version, then update the Timo config at https://github.com/allenai/timo/blob/c31be89c6da008bb81588ac2c31c28fe842e09b1/timo_services/configs/s2aff_v2.py#L13 and redeploy. Timo is effectively in maintenance mode; consider a standalone API if you need a long-lived endpoint.

OpenAlex works counts (optional)

If you update ROR and want fresh works_count values:

  1. uv run --python .venv python scripts/update_openalex_works_counts.py
  2. Upload openalex_works_counts.csv to s3://ai2-s2-research-public/s2aff-release/ (AI2 only)

[If you work at AI2] If you publish a new ROR version to S3, consider updating OpenAlex counts and uploading the refreshed openalex_works_counts.csv to s3://ai2-s2-research-public/s2aff-release/.

Modifying the ROR Database to Reduce Systematic Errors

Sometimes the ROR database has issues that we'd like to patch. For example, the ROR entry of https://ror.org/02b34td92 has the alias ai which is very common and ends up being a source of many errors. There is an entry in the data/ror_edits.jsonl file that patches this. To add more patches, just add a new line to this file with the following format:

{"ror_id": "https://ror.org/02b34td92", "action": "remove", "key": "aliases", "value": "ai"}

The actions can be remove or add. Key is usually aliases or acronyms. Make sure to make a commit to update the main repo with your changes.

Performance on Gold Data

Included in this repository in the data directory is a dataset of some challenging raw affiliation strings that have been manually assigned one or more ROR ids, and some of which have no correct RORs at all. Please see data/gold_affiliations.csv. The training split of this data has been used to tweak the first stage ROR indexer as well as train the second-stage LightGBM pairwise model.

Data sizes:

split n samples
train 1132
test 644
val 588

Latency (GPU, 100 affiliations)

Measured on this machine using:

$env:S2AFF_USE_CUDA="1"; $env:S2AFF_PIPELINE="python"; uv run --python .venv python scripts/approximate_runtime_and_memory_usage.py
$env:S2AFF_USE_CUDA="1"; $env:S2AFF_PIPELINE="rust"; uv run --python .venv python scripts/approximate_runtime_and_memory_usage.py

Python pipeline (Measured Jan 24, 2026):

  • NER: 2.061s
  • Stage 1 candidates: 48.745s
  • LightGBM rerank (top 100): 25.492s
  • Total timed stages: ~76.30s

Rust pipeline (Measured Jan 26, 2026; Rust cache hit):

  • NER: 1.981s
  • Stage 1 candidates: 2.760s
  • LightGBM rerank (top 100): 7.361s
  • Total timed stages: ~12.10s

First-Stage Model Performance

The following values are obtained when combining training, validation and test sets:

  • MRR: 0.897
  • Recall@25: 0.971
  • Recall@100: 0.984 <- reranking at this point
  • Recall@250: 0.986
  • Recall@500: 0.988
  • Recall@1000: 0.990
  • Recall@10000: 0.992
  • Recall@100000: 0.993

See scripts/evaluate_first_stage_ranker.py for methodology.

Second-Stage Model Performance

When reranking the top 100 candidates from the first stage model, the second stage model achieves the following:

  • Train - MAP: 0.946, MRR: 0.958, Prec@1: 0.931
  • Validation - MAP: 0.973, MRR: 0.981, Prec@1: 0.966
  • Test - MAP: 0.978, MRR: 0.981, Prec@1: 0.965

See scripts/train_lightgbm_ranker.py for details. The above values are computed in lines 217 to 219.

Getting Started

Once you've downloaded all of the model files, you can get up and running as follows:

from s2aff import S2AFF
from s2aff.ror import RORIndex
from s2aff.model import NERPredictor, PairwiseRORLightGBMReranker

# an example affiliation
raw_affiliation = "Department of Horse Racing, University of Washington, Seattle, WA 98115 USA"

# instantiate the ner model and the ror index. loading first time will take ~10-30s
ner_predictor = NERPredictor(use_cuda=False)  # defaults to models in the data directory
ror_index = RORIndex()  # ditto
pairwise_model = PairwiseRORLightGBMReranker(ror_index)

# the easiest way to use this package is with the `S2AFF` class.
s2aff = S2AFF(ner_predictor, ror_index, pairwise_model)
predictions = s2aff.predict([raw_affiliation])  # takes lists of strings
print(predictions)

# alternatively, you can also do it more manually
# first we get some fast-sh first-stage candidatges using the NERModel and RORIndex
candidates, scores = ror_index.get_candidates_from_raw_affiliation(raw_affiliation, ner_predictor)

# print out the top 5 candidates and scores from the first stage
# University of Washington is already first
for i, j in zip(candidates[:5], scores[:5]):
    print(ror_index.ror_dict[i]["name"], j)

# NOTE: ner_predictor should be batched if possible, as that will massively increase retrieval speed
# this is taken care of for you in the S2AFF class

# next, we rerank the top 100 candidates using the pairwise model
reranked_candidates, reranked_scores = pairwise_model.predict(raw_affiliation, candidates[:100], scores[:100])

# print out the top 5 candidates and scores from the second stage
# University of Washington is first with a high certainty
# but so is Seattle University
for i, j in zip(reranked_candidates[:5], reranked_scores[:5]):
    print(ror_index.ror_dict[i]["name"], j)

Testing

Quick Start

To run the fast unit test suite without pulling large artifacts:

uv run --python .venv python -m pytest -m "not slow and not requires_models"

This runs the fast unit tests that test core functionality without requiring model downloads.

Full Test Suite

Run the full suite including integration tests (requires downloaded models):

uv run --python .venv python -m pytest

When the model files are present under data/, tests marked requires_models run automatically. To force-run even if models are not detected, use --run-requires-models or set S2AFF_RUN_REQUIRES_MODELS=1. To skip them even when models are present, set S2AFF_SKIP_REQUIRES_MODELS=1.

Parity Harness (Requires Models)

Run exact old-vs-new parity checks (stage1, stage2, end2end):

uv run --python .venv python scripts/run_parity.py --mode all

To enable CUDA for NER in parity checks:

uv run --python .venv python scripts/run_parity.py --mode all --use-cuda

Coverage

Generate a coverage report:

uv run --python .venv python -m pytest --cov=s2aff --cov-report=term-missing

Continuous Integration

The project uses GitHub Actions for CI with two test jobs:

  • Fast Tests: Run on every push/PR without model downloads (~20 seconds)
  • Full Tests: Run weekly or on-demand with models (~5 minutes, downloads ~3.4GB once)

Why Not Rerank with BERT?

Because it's worse! We couldn't make roberta-large any better than what you see below (validation scores). image

TODO

S2AFF has not been trained to deal with non-English affiliations. It may work for some latin-character languages, but it may not. If this model is deployed, a language detector should first be applied to the affiliation strings. Then, an analysis will need to be done to determine which languages will be too error-prone.

About

link raw affiliation to ROR ids

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors