Skip to content

How to bulid index for dataset in BEIR? #7

@BluesPizza

Description

@BluesPizza

I tried to download scifact and use corpus.json in the zip file to bulit faiss index on it with Contriever (from Hugging Face). But no matter what type of index I choose, its performance is very poor. In fact, I don't even know which part of corpus.json should be used as "contents",So I only used "abstract" as the "contents" and "doc_id" as "id".
Below are the instructions I wrote following the guidelines from Pyserini:

  1. python -m pyserini.encode input --corpus /home/scifact.jsonl --fields text --delimiter "\n" --shard-id 0 --shard-num 1 output --embeddings /home/encoding --to-faiss encoder --encoder /home/facebook/contriever --fields text --batch 32 --fp16
  2. python -m pyserini.index.faiss --input /home/encoding --output /home/index --hnsw (I tried all types of index Pyserini supports)
    Could you tell me how to handle several datasets in BEIR and use Pyserini to build indexes for them? (It would be best if there are instructions or processed .jsonl file samples.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions