Hello, using your code, the model I trained gets much worse metrics than those reported in the paper. GPT-2 only achieves 37% accuracy on GSM8K (when trained only on GSM8K-AUG and then evaluated). I’d like to ask whether there are any training tricks involved, or whether it’s necessary to train on all datasets and then evaluate on GSM8K to reach the reported 43.7% accuracy.
Additionally, I’d like to ask: since the open-source GSM8K-AUG-NL only provides a training set, how should I perform evaluation?