Skip to content

Some question about results #21

@chenzehao82

Description

@chenzehao82

According to the reported results, the performance gap between the 4B and 14B models appears to be relatively small on both benchmarks (GSM8K: 82.4 vs. 83.7; HumanEval+: 75.0 vs. 76.8). This seems somewhat inconsistent with the expected scaling benefits of larger models. Could the authors elaborate on the possible reasons why the 14B model does not demonstrate a more substantial improvement over the 4B model? In particular, is this behavior attributable to limitations in training data, optimization or scaling strategy, inference settings (e.g., decoding or prompting), or the evaluation protocol itself?
GSM8K 4B: 82.4, 8B:81.1, 14B:83.7
HumanEval+ 4B: 75.0, 8B:74.4, 14B:76.8

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions