Question about the results in AlpacaEval2.0

Thanks for your excellent work. I meet some issues when I reprodue the results use the AlpacaEval2.0
I install the `alpaca_eval==0.6.2` an download the model from [princeton-nlp/Mistral-7B-Base-SFT-SimPO](https://huggingface.co/princeton-nlp/Mistral-7B-Base-SFT-SimPO)
Then I ran the command 
`alpaca_eval evaluate_from_model --model_configs '/apdcephfs/share_303747097/visityang/DPO_test_open/alpaca_eval/config/Mistral-7B-Base-SFT-SimPO.yaml'  --annotators_config weighted_alpaca_eval_gpt4_turbo`
with the following config:

Mistral-7B-Base-SFT-SimPO:
  prompt_template: /templates/mistral_base.txt
  fn_completions: vllm_local_completions
  completions_kwargs:
    model_name: /princeton-nlp/Mistral-7B-Base-SFT-SimPO
    model_kwargs:
      torch_dtype: 'bfloat16'
    max_new_tokens: 2048
    temperature: 0.7
    top_p: 1.0
    do_sample: True
    batch_size: 900
  pretty_name: Mistral-7B-Base-SFT-SimPO

The leaderboard results are as follows, but they differ significantly from the results reported in the paper.
                               |               |         length_controlled_winrate | win_rate | standard_error | n_total | avg_length|
|:-:|:-:|:-:|:-:|:-:|:-:|
gpt-4-turbo-2024-04-09          |                55.02                    |      46.12          |   1.47       |       805     |   1802
gpt4_1106_preview                    |                 50.00               |           50.00        |    0.00    |         805       | 2049
claude-3-opus-20240229              |           40.51           |                29.11  |          1.39       |       805  |      1388
claude-3-sonnet-20240229       |               34.87             |             25.56       |     1.34      |     805  |      1420
Meta-Llama-3-70B-Instruct         |              34.42           |                33.18      |      1.39     |     805    |    1919
gemini-pro                               |                    24.38            |               18.18        |    1.16       |     805    |    1456
Mixtral-8x7B-Instruct-v0.1        |                  23.69           |                 18.26     |         1.19  |         805     |   1465
Meta-Llama-3-8B-Instruct        |                22.92            |                22.57       |       1.26   |   805    |    1899
Mistral-7B-Instruct-v0.2          |                     17.11          |               14.72        |        1.08    |    805    |    1676
alpaca-7b                                 |                         5.88        |                2.59        |       0.49      |   805     |      396
**Mistral-7B-Base-SFT-SimPO**          |             **5.48**        |                  **6.48**      |          **0.85**    |     **805**    |    **2022**

Looking forward to your reply soon or any one can help with me for this problem!

Thanks very much

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the results in AlpacaEval2.0 #86

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	length_controlled_winrate	win_rate	standard_error	n_total	avg_length
gpt-4-turbo-2024-04-09	55.02	46.12	1.47	805	1802
gpt4_1106_preview	50.00	50.00	0.00	805	2049
claude-3-opus-20240229	40.51	29.11	1.39	805	1388
claude-3-sonnet-20240229	34.87	25.56	1.34	805	1420
Meta-Llama-3-70B-Instruct	34.42	33.18	1.39	805	1919
gemini-pro	24.38	18.18	1.16	805	1456
Mixtral-8x7B-Instruct-v0.1	23.69	18.26	1.19	805	1465
Meta-Llama-3-8B-Instruct	22.92	22.57	1.26	805	1899
Mistral-7B-Instruct-v0.2	17.11	14.72	1.08	805	1676
alpaca-7b	5.88	2.59	0.49	805	396
Mistral-7B-Base-SFT-SimPO	5.48	6.48	0.85	805	2022

Question about the results in AlpacaEval2.0 #86

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions