Add extracting answers using LLMs #12
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I think the evaluation of Qwen2.5-1.5B before RL in the original code is unreasonable. The issue is that many answers can't be extracted by rule - based methods as the model doesn't output in the required format. This causes the scores before RL to be too low, failing to reflect the model's true math ability. I believe DeepSeekR1 uses GRPO to enhance the model's chain - of - thought skills, not its ability to follow formatting rules. Your original evaluation method leads to a significant part of the ability improvement shown in the logs coming from Qwen learning to output in the specified format, which has nothing to do with the original intention of DeepSeek's RL method.
For example, in the solution to problem <Ben has 8 apples more than Phillip does. Tom has three eighths as many apples at Ben has. If Phillip has 40 apples, how many apples does Tom have?> shown in your <GRPO_From_Scratch_Multi_GPU_DataParallel_Qwen_2_5_1_5B_Instruct.ipynb> as the 2nd question, the answer is clearly correct but is marked as wrong due to the lack of an tag.
To fix this, I initiated this PR. I added a model - based answer extraction method to better assess the correctness of the answers. Based on my experimental results, the accuracy of Qwen before RL on GSM8K can reach 70%, not the about 20% you measured. This means you might have overestimated the effect of GRPO on the model's thinking ability under your experimental conditions. Here are some of my logs attached.
log.log