Skip to content

Conversation

@hjc-puro
Copy link
Contributor

@hjc-puro hjc-puro commented Aug 6, 2025

PR Type

  • [ x ] RL Environment PR - Complete Environment Snapshot & Zero-Training sections
  • Non-Environment PR - Complete Description, Related Issues & Type of Change sections

📝 General Information

Description

Add file-based eval logging to five environments - answer format, instruction following, tool calls, reasoning gym, letter counting.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • [ x ] New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Code refactor (no functional changes)
  • Build/CI/CD related changes
  • Other (please describe):

Testing

python tool_calling_server.py evaluate \
    --env.max_eval_samples $MAX_SAMPLES \
    --env.data_dir_to_save_evals tc_test \
    --openai.base_url "$BASE_URL" \
    --openai.model_name "$MODEL_NAME" \
    --openai.api_key "$OPENAI_API_KEY"

(replacing tool_calling_server.py with any of the other 4 as well)

@hjc-puro hjc-puro changed the title Add evals to 5 envs [draft] Add evals to 5 envs Aug 6, 2025
@hjc-puro hjc-puro changed the title [draft] Add evals to 5 envs Add evals to 5 envs Aug 6, 2025
@hjc-puro hjc-puro changed the title Add evals to 5 envs Update evals for 5 envs Aug 6, 2025
@hjc-puro hjc-puro requested a review from teknium1 August 6, 2025 22:07
Copy link
Collaborator

@dmahan93 dmahan93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next time please break this into individual PRs, fix the 'eval' removal and it's good to merge


async def rollout_and_score_eval(self, question: str, answer: str) -> dict:
"""Rollout and score evaluation with detailed sample data collection."""
eval_temperature = 0.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems a bit weird want to just make eval_temperature a base config param?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do it in mine (though mine are not usually 0.0) - I think this is good though as with reasoning models you want a diff eval temp than non-reasoners or with the weird 3.0+ temp RL'ed model etc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I don't disagree but if we're changing stuff here may as well make it able to be set by the user

n=1,
max_tokens=self.config.max_generation_tokens,
temperature=self.config.eval_temperature,
split="eval",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't do this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants