lsr-benchmark

Attention: This is currently under review.

The lsr-benchmark aims to support holisitc evaluations of the learned sparse retrieval paradigm to contrast efficiency and effectiveness accross diverse retrieval scenarios.

Task

The learned sparse retrieval paradigm conducts retrieval in three steps:

Documents are segmented into passages so that the passages can be processed by pre-trained transformers.
Documents and queries are embedded into a sparse learned embedding.
Retrieval systems create an index of the document embeddings to return a ranking for each embedded query.

You can submit solutions to step 2 (i.e., models that embed documents and queries into sparse embeddings) and/or solutions to step 3 (i.e., retrieval systems). The idea is then to validate all combinations of embeddings with all retrieval systems to identify which solutions work well for which use case, taking different notions of efficiency/effectiveness trade-offs into consideration. The passage segmentation for step 1 is open source (i.e., created via lsr-benchmark segment-corpus <IR-DATASETS-ID>) but fixed for this task.

Installation

You can install the lsr_benchmark via:

pip3 install lsr_benchmark

If you want the latest features, you can install from the main branch:

pip3 install git+https://github.com/reneuir/lsr-benchmark.git

Supported Corpora and Embeddings

Please run lsr-benchmark overview for an up-to-date overview over all datasets and all embeddings. Alternatively, online overview in TIRA provides an overview.

Running Tests

We have a suite of unit tests that you can run via:

# first install the local version of the lsr-benchmark
pip3 install -e .[dev,test]
# then run the unit tests
pytest .

Documentation and Tutorials

We have a set of tutorials available.

The lsr-benchmark --help command serves as entrypoint to the documentation.

Instructions to add new datasets are available in the data directory.

ToDo: Write how to add new datasets, embeddings, retrieval, evaluation
- short video

Data

The formats for data inputs and outputs aim to support slicing and dicing diverse query and document distributions while enabling caching, allowing for GreenIR research.

You can slice and dice the document texts and document embeddings via the API. The document texts for private corpora are only available within the TIRA sandbox whereas the document embeddings are publicly available for all corpora (as one can not re-construct the original documents from sparse embeddings).

dataset = lsr_benchmark.load('<IR-DATASETS-ID>')

# process the document embeddings:
for doc in dataset.docs_iter(embedding='<EMBEDDING-MODEL>', passage_aggregation="first-passage"):
    doc # namedtuple<doc_id, embedding>

# process the document embeddings for all segments:
for doc in dataset.docs_iter(embedding='<EMBEDDING-MODEL>'):
    doc # namedtuple<doc_id, segments.embedding>

# process the document texts:
for doc in dataset.docs_iter(embedding=None):
    doc # namedtuple<doc_id, segments.text>

# process the document texts via segmented versions in ir_datasets
lsr_benchmark.register_to_ir_datasets()
for segmented_doc in ir_datasets.load(f"lsr-benchmark/{dataset}/segmented")
    doc # namedtuple<doc_id, segment>

Format of Document Texts

Inspired by the processing of MS MARCO v2.1, each document consists of a doc_id and a list of text segments that are short enough to be processed by pre-trained transformers. For instance, a document that consists of 4 passages (e.g., "text-of-passage-1 text-of-passage-2 text-of-passage-3 text-of-passage-4") would be represented as:

doc_id: 12fd3396-e4d7-4c0f-b468-5a82402b5336
segments:
- {"start": 1, "end": 2, "text": "text-of-passage-1 text-of-passage-2"}
- {"start": 2, "end": 3, "text": "text-of-passage-2 text-of-passage-3"}
- {"start": 3, "end": 4, "text": "text-of-passage-3 text-of-passage-4"}

Format of Document Embeddings

Each document consists of a doc_id and a list of text segments that are short enough to be processed by pre-trained transformers. For instance, a document that consists of 4 passages would be represented as:

doc_id: 12fd3396-e4d7-4c0f-b468-5a82402b5336
segments:
- {"start": 1, "end": 2, "embedding": {"term-1": 0.123, "term-2": 0.912}}
- {"start": 2, "end": 3, "embedding": {"term-1": 0.421, "term-3": 0.743}}
- {"start": 3, "end": 4, "embedding": {"term-2": 0.108, "term-4": 0.043}}

Evaluation

The online overview in TIRA provides an overview of aggregated evaluations. Alternatively, all data and further custom evaluations are available in the step-04-evaluation directory of this repository.

Our evaluation methodology encourages the development of diverse and novel measures for lsr models that take efficiency and effectiveness into consideration. We assume that a suitable interpretation of efficiency for a target task highly depends on the application and its context. Therefore, we aim to measure as many efficiency-oriented aspects as possible in a standardized way with the tirex-tracker to ensure that different efficiency/effectiveness interpretations can be evaluated post-hoc. This methodology and related aspects were developed as part of the ReNeuIR workshop series held at SIGIR 2022, 2023, 2024, and 2025.

Name		Name	Last commit message	Last commit date
Latest commit History 357 Commits
.github/workflows		.github/workflows
assets		assets
data		data
lsr_benchmark		lsr_benchmark
step-02-embedding-approaches		step-02-embedding-approaches
step-03-retrieval-approaches		step-03-retrieval-approaches
step-04-evaluation		step-04-evaluation
test		test
tutorials		tutorials
.devcontainer.json		.devcontainer.json
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

lsr-benchmark

Task

Installation

Supported Corpora and Embeddings

Running Tests

Documentation and Tutorials

Data

Format of Document Texts

Format of Document Embeddings

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

reneuir/lsr-benchmark

Folders and files

Latest commit

History

Repository files navigation

lsr-benchmark

Task

Installation

Supported Corpora and Embeddings

Running Tests

Documentation and Tutorials

Data

Format of Document Texts

Format of Document Embeddings

Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages