-
Notifications
You must be signed in to change notification settings - Fork 39
Open
Description
I tried to download scifact and use corpus.json in the zip file to bulit faiss index on it with Contriever (from Hugging Face). But no matter what type of index I choose, its performance is very poor. In fact, I don't even know which part of corpus.json should be used as "contents",So I only used "abstract" as the "contents" and "doc_id" as "id".
Below are the instructions I wrote following the guidelines from Pyserini:
- python -m pyserini.encode input --corpus /home/scifact.jsonl --fields text --delimiter "\n" --shard-id 0 --shard-num 1 output --embeddings /home/encoding --to-faiss encoder --encoder /home/facebook/contriever --fields text --batch 32 --fp16
- python -m pyserini.index.faiss --input /home/encoding --output /home/index --hnsw (I tried all types of index Pyserini supports)
Could you tell me how to handle several datasets in BEIR and use Pyserini to build indexes for them? (It would be best if there are instructions or processed .jsonl file samples.)
Metadata
Metadata
Assignees
Labels
No labels