CLI • Python API • Citation
Attention: This is currently under review.
The lsr-benchmark aims to support holisitc evaluations of the learned sparse retrieval paradigm to contrast efficiency and effectiveness accross diverse retrieval scenarios.
The learned sparse retrieval paradigm conducts retrieval in three steps:
- Documents are segmented into passages so that the passages can be processed by pre-trained transformers.
- Documents and queries are embedded into a sparse learned embedding.
- Retrieval systems create an index of the document embeddings to return a ranking for each embedded query.
You can submit solutions to step 2 (i.e., models that embed documents and queries into sparse embeddings) and/or solutions to step 3 (i.e., retrieval systems). The idea is then to validate all combinations of embeddings with all retrieval systems to identify which solutions work well for which use case, taking different notions of efficiency/effectiveness trade-offs into consideration. The passage segmentation for step 1 is open source (i.e., created via lsr-benchmark segment-corpus <IR-DATASETS-ID>) but fixed for this task.
You can install the lsr_benchmark via:
pip3 install lsr_benchmark
If you want the latest features, you can install from the main branch:
pip3 install git+https://github.com/reneuir/lsr-benchmark.git
Please run lsr-benchmark overview for an up-to-date overview over all datasets and all embeddings. Alternatively, online overview in TIRA provides an overview.
We have a suite of unit tests that you can run via:
# first install the local version of the lsr-benchmark
pip3 install -e .[dev,test]
# then run the unit tests
pytest .
We have a set of tutorials available.
The lsr-benchmark --help command serves as entrypoint to the documentation.
Instructions to add new datasets are available in the data directory.
- ToDo: Write how to add new datasets, embeddings, retrieval, evaluation
- short video
The formats for data inputs and outputs aim to support slicing and dicing diverse query and document distributions while enabling caching, allowing for GreenIR research.
You can slice and dice the document texts and document embeddings via the API. The document texts for private corpora are only available within the TIRA sandbox whereas the document embeddings are publicly available for all corpora (as one can not re-construct the original documents from sparse embeddings).
dataset = lsr_benchmark.load('<IR-DATASETS-ID>')
# process the document embeddings:
for doc in dataset.docs_iter(embedding='<EMBEDDING-MODEL>', passage_aggregation="first-passage"):
doc # namedtuple<doc_id, embedding>
# process the document embeddings for all segments:
for doc in dataset.docs_iter(embedding='<EMBEDDING-MODEL>'):
doc # namedtuple<doc_id, segments.embedding>
# process the document texts:
for doc in dataset.docs_iter(embedding=None):
doc # namedtuple<doc_id, segments.text>
# process the document texts via segmented versions in ir_datasets
lsr_benchmark.register_to_ir_datasets()
for segmented_doc in ir_datasets.load(f"lsr-benchmark/{dataset}/segmented")
doc # namedtuple<doc_id, segment>
Inspired by the processing of MS MARCO v2.1, each document consists of a doc_id and a list of text segments that are short enough to be processed by pre-trained transformers. For instance, a document that consists of 4 passages (e.g., "text-of-passage-1 text-of-passage-2 text-of-passage-3 text-of-passage-4") would be represented as:
- doc_id: 12fd3396-e4d7-4c0f-b468-5a82402b5336
- segments:
- {"start": 1, "end": 2, "text": "text-of-passage-1 text-of-passage-2"}
- {"start": 2, "end": 3, "text": "text-of-passage-2 text-of-passage-3"}
- {"start": 3, "end": 4, "text": "text-of-passage-3 text-of-passage-4"}
Each document consists of a doc_id and a list of text segments that are short enough to be processed by pre-trained transformers. For instance, a document that consists of 4 passages would be represented as:
- doc_id: 12fd3396-e4d7-4c0f-b468-5a82402b5336
- segments:
- {"start": 1, "end": 2, "embedding": {"term-1": 0.123, "term-2": 0.912}}
- {"start": 2, "end": 3, "embedding": {"term-1": 0.421, "term-3": 0.743}}
- {"start": 3, "end": 4, "embedding": {"term-2": 0.108, "term-4": 0.043}}
The online overview in TIRA provides an overview of aggregated evaluations. Alternatively, all data and further custom evaluations are available in the step-04-evaluation directory of this repository.
Our evaluation methodology encourages the development of diverse and novel measures for lsr models that take efficiency and effectiveness into consideration. We assume that a suitable interpretation of efficiency for a target task highly depends on the application and its context. Therefore, we aim to measure as many efficiency-oriented aspects as possible in a standardized way with the tirex-tracker to ensure that different efficiency/effectiveness interpretations can be evaluated post-hoc. This methodology and related aspects were developed as part of the ReNeuIR workshop series held at SIGIR 2022, 2023, 2024, and 2025.
