Hi,
Thank you for your great work and for open-sourcing the code!
In issue #12 you mentioned:
"In our paper and code we used BEIR dataset. We used test split for NQ and HotpotQA, and train split for MS-MARCO."
I'm still a bit confused: Table 14 in the paper lists 6,980 queries for the MS MARCO training set, but the official MS MARCO train.tsv contains far more (~500k+ unique queries with relevance labels).
Could you please share the exact 6,980-query subset used in the paper (or the file/link/processing step that loads/selects these queries)?
Thank you so much for your time and for any clarification!