-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
I am trying to reproduce the SCIQ results from the SC'23 paper using Eleuther's LM evaluation harness.
These are my results
| Model | SciQ | PIQA |
|---|---|---|
| forge-bio | 0.788 | |
| forge-che | 0.821 | |
| forge-eng | 0.793 | |
| forge-mat | 0.777 | |
| forge-phy | 0.761 | |
| forge-soc | 0.82 | |
| forge-s1 | 0.787 | |
| forge-s2 | 0.783 | |
| forge-s3 | 0.805 | |
| forge-s4 | 0.86 | |
| forge-m1 | 0.82 | |
| forge-m2 | 0.574 | 0.5577 |
| forge-l | 0.242 |
The highlighted scores are much lower than the others, and than what is expected from Table 8 of the paper. A quick check of the evaluation logs (data/eval/forge-m2) suggests that these are roughly the scores of the m2 checkpoint at iteration 1000, and probably of some very early checkpoint of forge-l.
I downloaded the checkpoints from the links in the README.md. I suspect that the dropbox versions were somehow mixed up.
Command line
lm_eval --model hf --model_args pretrained=forge-bio,parallelize=True --tasks sciq --device cuda
Metadata
Metadata
Assignees
Labels
No labels