Some question about results

According to the reported results, the performance gap between the 4B and 14B models appears to be relatively small on both benchmarks (GSM8K: 82.4 vs. 83.7; HumanEval+: 75.0 vs. 76.8). This seems somewhat inconsistent with the expected scaling benefits of larger models. Could the authors elaborate on the possible reasons why the 14B model does not demonstrate a more substantial improvement over the 4B model? In particular, is this behavior attributable to limitations in training data, optimization or scaling strategy, inference settings (e.g., decoding or prompting), or the evaluation protocol itself?
GSM8K   4B: 82.4,   8B：81.1，   14B：83.7
HumanEval+   4B: 75.0,   8B：74.4，   14B：76.8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some question about results #21

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Some question about results #21

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions