Skip to content

Discrepancy between the metrics calculated in the scripts and the results from the paper! #28

@Arseny5

Description

@Arseny5

Hello! Thank you so much for your work.

I tried to calculated metrics by script DiffuLLaMA/evaluation/eval-diffugpt.py for model that you loaded on HF https://huggingface.co/diffusionfamily/diffugpt-s

The results did NOT match the results from the paper! Especially for generative metrics:

Metric Paper (DiffuGPT-S, 1M steps) Official model (HuggingFace)
LAMBADA 21.6% 8.19%
HellaSwag 33.4% 25.23%
WinoGrande 50.8% 51.07%
Social IQA 37.0% ? (timeout)
PIQA 57.7% ? (timeout)
ROCStories 13.7/1.4/12.6 0.69/0.00/0.69 ❌❌❌

Even the official model does NOT show the results from the paper when evaluated by your script!

Maybe you used a DIFFERENT evaluation methodology for generative metrics (LAMBADA, ROCStories)? Could it be that you published the wrong model or evaluation scripts not same from your experiments?

Could you please share clear working scripts for evaluation of your model (DiffuGPT2-S) from HF to reproduce results from the paper?
Thank you so much

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions