-
Notifications
You must be signed in to change notification settings - Fork 73
Question about the results in AlpacaEval2.0 #86
Description
Thanks for your excellent work. I meet some issues when I reprodue the results use the AlpacaEval2.0
I install the alpaca_eval==0.6.2 an download the model from princeton-nlp/Mistral-7B-Base-SFT-SimPO
Then I ran the command
alpaca_eval evaluate_from_model --model_configs '/apdcephfs/share_303747097/visityang/DPO_test_open/alpaca_eval/config/Mistral-7B-Base-SFT-SimPO.yaml' --annotators_config weighted_alpaca_eval_gpt4_turbo
with the following config:
Mistral-7B-Base-SFT-SimPO:
prompt_template: /templates/mistral_base.txt
fn_completions: vllm_local_completions
completions_kwargs:
model_name: /princeton-nlp/Mistral-7B-Base-SFT-SimPO
model_kwargs:
torch_dtype: 'bfloat16'
max_new_tokens: 2048
temperature: 0.7
top_p: 1.0
do_sample: True
batch_size: 900
pretty_name: Mistral-7B-Base-SFT-SimPO
The leaderboard results are as follows, but they differ significantly from the results reported in the paper.
| length_controlled_winrate | win_rate | standard_error | n_total | avg_length | |
|---|---|---|---|---|---|
| gpt-4-turbo-2024-04-09 | 55.02 | 46.12 | 1.47 | 805 | 1802 |
| gpt4_1106_preview | 50.00 | 50.00 | 0.00 | 805 | 2049 |
| claude-3-opus-20240229 | 40.51 | 29.11 | 1.39 | 805 | 1388 |
| claude-3-sonnet-20240229 | 34.87 | 25.56 | 1.34 | 805 | 1420 |
| Meta-Llama-3-70B-Instruct | 34.42 | 33.18 | 1.39 | 805 | 1919 |
| gemini-pro | 24.38 | 18.18 | 1.16 | 805 | 1456 |
| Mixtral-8x7B-Instruct-v0.1 | 23.69 | 18.26 | 1.19 | 805 | 1465 |
| Meta-Llama-3-8B-Instruct | 22.92 | 22.57 | 1.26 | 805 | 1899 |
| Mistral-7B-Instruct-v0.2 | 17.11 | 14.72 | 1.08 | 805 | 1676 |
| alpaca-7b | 5.88 | 2.59 | 0.49 | 805 | 396 |
| Mistral-7B-Base-SFT-SimPO | 5.48 | 6.48 | 0.85 | 805 | 2022 |
Looking forward to your reply soon or any one can help with me for this problem!
Thanks very much