Bootstrapped Pre-training with Dynamic Identifier Prediction for Generative Retrieval (ACL 2024)

Abstract

Generative retrieval uses differentiable search indexes to directly generate relevant document identifiers in response to a query. Recent studies have highlighted the potential of a strong generative retrieval model, trained with carefully crafted pre-training tasks, to enhance downstream retrieval tasks via fine-tuning. However, the full power of pre-training for generative retrieval remains underexploited due to its reliance on pre-defined static document identifiers, which may not align with evolving model parameters. In this work, we introduce BootRet, a bootstrapped pre-training method for generative retrieval that dynamically adjusts document identifiers during pre-training to accommodate the continuing memorization of the corpus. BootRet involves three key training phases: (i) initial identifier generation, (ii) pre-training via corpus indexing and relevance prediction tasks, and (iii) bootstrapping for identifier updates. To facilitate the pre-training phase, we further introduce noisy documents and pseudo-queries, generated by large language models, to resemble semantic connections in both indexing and retrieval tasks. Experimental results demonstrate that BootRet significantly outperforms existing pre-training generative retrieval baselines and performs well even in zero-shot settings.

Method overview

Resources

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
code		code
resources		resources
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bootstrapped Pre-training with Dynamic Identifier Prediction for Generative Retrieval (ACL 2024)

Abstract

Method overview

Resources

About

Uh oh!

Releases

Packages

Languages

lightningtyb/BootRet

Folders and files

Latest commit

History

Repository files navigation

Bootstrapped Pre-training with Dynamic Identifier Prediction for Generative Retrieval (ACL 2024)

Abstract

Method overview

Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages