Hello! Thank you so much for your work.
I tried to calculated metrics by script DiffuLLaMA/evaluation/eval-diffugpt.py for model that you loaded on HF https://huggingface.co/diffusionfamily/diffugpt-s
The results did NOT match the results from the paper! Especially for generative metrics:
| Metric |
Paper (DiffuGPT-S, 1M steps) |
Official model (HuggingFace) |
| LAMBADA |
21.6% |
8.19% ❌ |
| HellaSwag |
33.4% |
25.23% ❌ |
| WinoGrande |
50.8% |
51.07% ✅ |
| Social IQA |
37.0% |
? (timeout) |
| PIQA |
57.7% |
? (timeout) |
| ROCStories |
13.7/1.4/12.6 |
0.69/0.00/0.69 ❌❌❌ |
Even the official model does NOT show the results from the paper when evaluated by your script!
Maybe you used a DIFFERENT evaluation methodology for generative metrics (LAMBADA, ROCStories)? Could it be that you published the wrong model or evaluation scripts not same from your experiments?
Could you please share clear working scripts for evaluation of your model (DiffuGPT2-S) from HF to reproduce results from the paper?
Thank you so much