Discrepancy between the metrics calculated in the scripts and the results from the paper!

Hello! Thank you so much for your work.

I tried to calculated metrics by script `DiffuLLaMA/evaluation/eval-diffugpt.py` for model that you loaded on HF _https://huggingface.co/diffusionfamily/diffugpt-s_

**The results did NOT match the results from the paper! Especially for generative metrics:**

| Metric | **Paper** (DiffuGPT-S, 1M steps) | **Official model** (HuggingFace) | 
|---------|--------------------------------|-------------------------------------|
| **LAMBADA** | 21.6% | **8.19%** ❌ | 
| **HellaSwag** | 33.4% | **25.23%** ❌ | 
| **WinoGrande** | 50.8% | **51.07%** ✅ | 
| **Social IQA** | 37.0% | ? (timeout) | 
| **PIQA** | 57.7% | ? (timeout) | 
| **ROCStories** | 13.7/1.4/12.6 | **0.69/0.00/0.69** ❌❌❌ | 

Even the official model does NOT show the results from the paper when evaluated by your script!

Maybe you used a DIFFERENT evaluation methodology for generative metrics (LAMBADA, ROCStories)? Could it be that you published the wrong model or evaluation scripts not same from your experiments?

**Could you please share clear working scripts for evaluation of your model (DiffuGPT2-S) from HF to reproduce results from the paper?** 
Thank you so much

Metric	Paper (DiffuGPT-S, 1M steps)	Official model (HuggingFace)
LAMBADA	21.6%	8.19% ❌
HellaSwag	33.4%	25.23% ❌
WinoGrande	50.8%	51.07% ✅
Social IQA	37.0%	? (timeout)
PIQA	57.7%	? (timeout)
ROCStories	13.7/1.4/12.6	0.69/0.00/0.69 ❌❌❌

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discrepancy between the metrics calculated in the scripts and the results from the paper! #28

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Discrepancy between the metrics calculated in the scripts and the results from the paper! #28

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions