-
Notifications
You must be signed in to change notification settings - Fork 210
Update evals for 5 envs #229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
dmahan93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
next time please break this into individual PRs, fix the 'eval' removal and it's good to merge
|
|
||
| async def rollout_and_score_eval(self, question: str, answer: str) -> dict: | ||
| """Rollout and score evaluation with detailed sample data collection.""" | ||
| eval_temperature = 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems a bit weird want to just make eval_temperature a base config param?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do it in mine (though mine are not usually 0.0) - I think this is good though as with reasoning models you want a diff eval temp than non-reasoners or with the weird 3.0+ temp RL'ed model etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I don't disagree but if we're changing stuff here may as well make it able to be set by the user
| n=1, | ||
| max_tokens=self.config.max_generation_tokens, | ||
| temperature=self.config.eval_temperature, | ||
| split="eval", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't do this
PR Type
📝 General Information
Description
Add file-based eval logging to five environments - answer format, instruction following, tool calls, reasoning gym, letter counting.
Type of Change
Testing
python tool_calling_server.py evaluate \ --env.max_eval_samples $MAX_SAMPLES \ --env.data_dir_to_save_evals tc_test \ --openai.base_url "$BASE_URL" \ --openai.model_name "$MODEL_NAME" \ --openai.api_key "$OPENAI_API_KEY"(replacing
tool_calling_server.pywith any of the other 4 as well)