About the reproduction issue.

Hello, using your code, the model I trained gets much worse metrics than those reported in the paper. GPT-2 only achieves 37% accuracy on GSM8K (when trained only on GSM8K-AUG and then evaluated). I’d like to ask whether there are any training tricks involved, or whether it’s necessary to train on all datasets and then evaluate on GSM8K to reach the reported 43.7% accuracy. 
Additionally, I’d like to ask: since the open-source GSM8K-AUG-NL only provides a training set, how should I perform evaluation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the reproduction issue. #12

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

About the reproduction issue. #12

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions