Skip to content

About the reproduction issue. #12

@lqcStar

Description

@lqcStar

Hello, using your code, the model I trained gets much worse metrics than those reported in the paper. GPT-2 only achieves 37% accuracy on GSM8K (when trained only on GSM8K-AUG and then evaluated). I’d like to ask whether there are any training tricks involved, or whether it’s necessary to train on all datasets and then evaluate on GSM8K to reach the reported 43.7% accuracy.
Additionally, I’d like to ask: since the open-source GSM8K-AUG-NL only provides a training set, how should I perform evaluation?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions