Question regarding reward calculation

Thanks for your great work! From the code, it appears that during training the answer is extracted directly using the regex r"`<answer>(.*?)</answer>`" and the reward is computed based on accuracy. It seems that the format reward (e.g., enforcing the `<think></think><answer></answer> `structure) is not incorporated into the reward function.

If this is the case, would directly applying RL on Qwen/Qwen2.5-7B without an explicit format-reward lead to lower training efficiency or stability? Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question regarding reward calculation #151

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question regarding reward calculation #151

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions