Thanks for your great work! From the code, it appears that during training the answer is extracted directly using the regex r"<answer>(.*?)</answer>" and the reward is computed based on accuracy. It seems that the format reward (e.g., enforcing the <think></think><answer></answer> structure) is not incorporated into the reward function.
If this is the case, would directly applying RL on Qwen/Qwen2.5-7B without an explicit format-reward lead to lower training efficiency or stability? Thanks!