Update evals for 5 envs #229

hjc-puro · 2025-08-06T18:49:20Z

PR Type

[ x ] RL Environment PR - Complete Environment Snapshot & Zero-Training sections
Non-Environment PR - Complete Description, Related Issues & Type of Change sections

📝 General Information

Description

Add file-based eval logging to five environments - answer format, instruction following, tool calls, reasoning gym, letter counting.

Type of Change

Bug fix (non-breaking change which fixes an issue)
[ x ] New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update
Code refactor (no functional changes)
Build/CI/CD related changes
Other (please describe):

Testing

python tool_calling_server.py evaluate \
    --env.max_eval_samples $MAX_SAMPLES \
    --env.data_dir_to_save_evals tc_test \
    --openai.base_url "$BASE_URL" \
    --openai.model_name "$MODEL_NAME" \
    --openai.api_key "$OPENAI_API_KEY"

(replacing tool_calling_server.py with any of the other 4 as well)

dmahan93

next time please break this into individual PRs, fix the 'eval' removal and it's good to merge

dmahan93 · 2025-08-14T15:56:12Z

environments/gsm8k_server.py


    async def rollout_and_score_eval(self, question: str, answer: str) -> dict:
        """Rollout and score evaluation with detailed sample data collection."""
+        eval_temperature = 0.0


this seems a bit weird want to just make eval_temperature a base config param?

I do it in mine (though mine are not usually 0.0) - I think this is good though as with reasoning models you want a diff eval temp than non-reasoners or with the weird 3.0+ temp RL'ed model etc

yeah I don't disagree but if we're changing stuff here may as well make it able to be set by the user

dmahan93 · 2025-08-14T15:56:57Z

environments/letter_counting_environment.py

            n=1,
            max_tokens=self.config.max_generation_tokens,
            temperature=self.config.eval_temperature,
-            split="eval",


don't do this

add envs

f51ac13

hjc-puro changed the title ~~Add evals to 5 envs~~ [draft] Add evals to 5 envs Aug 6, 2025

hjc-puro added 3 commits August 6, 2025 21:05

hf

90aec62

delete test data

257efa8

split has only test

04bf01a

hjc-puro changed the title ~~[draft] Add evals to 5 envs~~ Add evals to 5 envs Aug 6, 2025

hjc-puro changed the title ~~Add evals to 5 envs~~ Update evals for 5 envs Aug 6, 2025

hjc-puro requested a review from teknium1 August 6, 2025 22:07

dmahan93 reviewed Aug 14, 2025

View reviewed changes

teknium1 approved these changes Sep 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update evals for 5 envs #229

Update evals for 5 envs #229

Uh oh!

hjc-puro commented Aug 6, 2025 •

edited

Loading

Uh oh!

dmahan93 left a comment

Uh oh!

dmahan93 Aug 14, 2025

Uh oh!

teknium1 Sep 8, 2025

Uh oh!

dmahan93 Sep 8, 2025

Uh oh!

dmahan93 Aug 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Update evals for 5 envs #229

Are you sure you want to change the base?

Update evals for 5 envs #229

Uh oh!

Conversation

hjc-puro commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

📝 General Information

Description

Type of Change

Uh oh!

dmahan93 left a comment

Choose a reason for hiding this comment

Uh oh!

dmahan93 Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

teknium1 Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

dmahan93 Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

dmahan93 Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hjc-puro commented Aug 6, 2025 •

edited

Loading