-
Notifications
You must be signed in to change notification settings - Fork 103
Open
Description
According to the reported results, the performance gap between the 4B and 14B models appears to be relatively small on both benchmarks (GSM8K: 82.4 vs. 83.7; HumanEval+: 75.0 vs. 76.8). This seems somewhat inconsistent with the expected scaling benefits of larger models. Could the authors elaborate on the possible reasons why the 14B model does not demonstrate a more substantial improvement over the 4B model? In particular, is this behavior attributable to limitations in training data, optimization or scaling strategy, inference settings (e.g., decoding or prompting), or the evaluation protocol itself?
GSM8K 4B: 82.4, 8B:81.1, 14B:83.7
HumanEval+ 4B: 75.0, 8B:74.4, 14B:76.8
Metadata
Metadata
Assignees
Labels
No labels