Skip to content

Fair and square model selection  #10

@forrestbao

Description

@forrestbao

In the code below, we used two models of quite different capacities: For bert-score and bertscore-sentence-MNLI, we used Roberta-Large, which is about 1.6GB (default for bert-score implemented in HF's evaluate library). But for bertscore-sentence, which is built on top of sentence-bert, we used all-MiniLM-L6-v2, which has only 80MB. So this gives our bertscore-sentence approach a huge disadvantage. Of course, we pick that one to be fast in pilot studies.

https://github.com/SigmaWe/DocAsRef_0/blob/de4de4b4275e661621bebf3b2f92d8676e2f81c2/dar_env.py#L8-L11

I think if we use a large-capacity model for bertscore-sentence, we can further boost our sentence-based pair-wise approach.

There are two directions we can try:

  1. A quick one is we just use a larger model trained by sentence-bert project. Let's try two all-mpnet-base-v2 and all-roberta-large-v1. The former one is still much smaller than Roberta-large but has higher scores according to sentence-bert leader board while the latter one is just RoBERTa-large but trained using sentence-bert's dot-product loss. Thus let's test both of these two versions below:

       sent_embedder = sentence_transformers.SentenceTransformer("all-mpnet-base-v2") 
       sent_embedder = sentence_transformers.SentenceTransformer("all-roberta-large-v1") 

    BTW, we can use HF's transformers library for Sentence-Bert as well. In this way, we don't have importing bothtransformers and sentence_transformers. We can consolidate all code under one framework.

  2. A slower but completely fair approach: we also use RoBERTa-large (generally trained, not on MNLI) to embed the sentence and extract the embedding corresponding to the [CLS] token. For how to do it, see here.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions