Add extracting answers using LLMs #12

xvlincaigou · 2025-03-09T13:52:41Z

I think the evaluation of Qwen2.5-1.5B before RL in the original code is unreasonable. The issue is that many answers can't be extracted by rule - based methods as the model doesn't output in the required format. This causes the scores before RL to be too low, failing to reflect the model's true math ability. I believe DeepSeekR1 uses GRPO to enhance the model's chain - of - thought skills, not its ability to follow formatting rules. Your original evaluation method leads to a significant part of the ability improvement shown in the logs coming from Qwen learning to output in the specified format, which has nothing to do with the original intention of DeepSeek's RL method.
For example, in the solution to problem <Ben has 8 apples more than Phillip does. Tom has three eighths as many apples at Ben has. If Phillip has 40 apples, how many apples does Tom have?> shown in your <GRPO_From_Scratch_Multi_GPU_DataParallel_Qwen_2_5_1_5B_Instruct.ipynb> as the 2nd question, the answer is clearly correct but is marked as wrong due to the lack of an tag.
To fix this, I initiated this PR. I added a model - based answer extraction method to better assess the correctness of the answers. Based on my experimental results, the accuracy of Qwen before RL on GSM8K can reach 70%, not the about 20% you measured. This means you might have overestimated the effect of GRPO on the model's thinking ability under your experimental conditions. Here are some of my logs attached.
log.log

Add extracting answers using LLMs

bd9e889

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add extracting answers using LLMs #12

Add extracting answers using LLMs #12

Uh oh!

xvlincaigou commented Mar 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add extracting answers using LLMs #12

Are you sure you want to change the base?

Add extracting answers using LLMs #12

Uh oh!

Conversation

xvlincaigou commented Mar 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant