Skip to content

Question about MS MARCO dataset query count in Table 14 #28

@xiaohu677

Description

@xiaohu677

Hi,

Thank you for your great work and for open-sourcing the code!

In issue #12 you mentioned:
"In our paper and code we used BEIR dataset. We used test split for NQ and HotpotQA, and train split for MS-MARCO."

I'm still a bit confused: Table 14 in the paper lists 6,980 queries for the MS MARCO training set, but the official MS MARCO train.tsv contains far more (~500k+ unique queries with relevance labels).

Could you please share the exact 6,980-query subset used in the paper (or the file/link/processing step that loads/selects these queries)?

Thank you so much for your time and for any clarification!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions