effectiveness valiadation on other base models

Impressive and thorough work! Especially for sharing code and datasets.

However, have you considered running ablations on models like llama3 and qwen2.5/qwen3 (instead of distilled versions)? 

And you might have noticed that there's a blog arguing that many RL works overclaimed their effectiveness because of the Qwens' specificity (https://safe-lip-9a8.notion.site/Incorrect-Baseline-Evaluations-Call-into-Question-Recent-LLM-RL-Claims-2012f1fbf0ee8094ab8ded1953c15a37). 
The distilled versions may be more complicated, as they were trained on a private dataset from DeepSeek.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

effectiveness valiadation on other base models #47

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

effectiveness valiadation on other base models #47

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions