Skip to content

Question about the results in AlpacaEval2.0 #86

@visitworld123

Description

@visitworld123

Thanks for your excellent work. I meet some issues when I reprodue the results use the AlpacaEval2.0
I install the alpaca_eval==0.6.2 an download the model from princeton-nlp/Mistral-7B-Base-SFT-SimPO
Then I ran the command
alpaca_eval evaluate_from_model --model_configs '/apdcephfs/share_303747097/visityang/DPO_test_open/alpaca_eval/config/Mistral-7B-Base-SFT-SimPO.yaml' --annotators_config weighted_alpaca_eval_gpt4_turbo
with the following config:

Mistral-7B-Base-SFT-SimPO:
prompt_template: /templates/mistral_base.txt
fn_completions: vllm_local_completions
completions_kwargs:
model_name: /princeton-nlp/Mistral-7B-Base-SFT-SimPO
model_kwargs:
torch_dtype: 'bfloat16'
max_new_tokens: 2048
temperature: 0.7
top_p: 1.0
do_sample: True
batch_size: 900
pretty_name: Mistral-7B-Base-SFT-SimPO

The leaderboard results are as follows, but they differ significantly from the results reported in the paper.

length_controlled_winrate win_rate standard_error n_total avg_length
gpt-4-turbo-2024-04-09 55.02 46.12 1.47 805 1802
gpt4_1106_preview 50.00 50.00 0.00 805 2049
claude-3-opus-20240229 40.51 29.11 1.39 805 1388
claude-3-sonnet-20240229 34.87 25.56 1.34 805 1420
Meta-Llama-3-70B-Instruct 34.42 33.18 1.39 805 1919
gemini-pro 24.38 18.18 1.16 805 1456
Mixtral-8x7B-Instruct-v0.1 23.69 18.26 1.19 805 1465
Meta-Llama-3-8B-Instruct 22.92 22.57 1.26 805 1899
Mistral-7B-Instruct-v0.2 17.11 14.72 1.08 805 1676
alpaca-7b 5.88 2.59 0.49 805 396
Mistral-7B-Base-SFT-SimPO 5.48 6.48 0.85 805 2022

Looking forward to your reply soon or any one can help with me for this problem!

Thanks very much

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions