Skip to content

Conversation

@snimu
Copy link
Contributor

@snimu snimu commented Jan 14, 2026

Description

Show metrics during vf-eval rollout. On a narrow screen:

image

On a wide screen (before the first results come in):

image

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Note

Introduces a richer progress display for vf-eval with multi-environment awareness and live metric tracking.

  • Adds verifiers/utils/progress_utils.py with MultiEnvProgress and RolloutProgress to render per-env progress and rolling averages of reward and rubric metrics; includes TTY detection via can_render_rich_progress() and falls back to tqdm.
  • Updates Environment.generate to precompute per-task totals, discover metric names via env_map or fallback, initialize rich progress (multi-env or single), call update(states) as tasks complete, and cleanly start/stop; removes manual reward averaging/postfix logic.
  • Minor typing/cleanup adjustments in data_utils.py and message_utils.py (casts, removed type ignores).

Written by Cursor Bugbot for commit 6d3a02a. This will update automatically on new commits. Configure here.

if pbar is not None:
if rich_progress is not None:
rich_progress.update(states)
elif pbar is not None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tqdm fallback no longer updates rolling average reward

Medium Severity

When the rich progress display cannot render (e.g., non-TTY environments like CI/CD), the code falls back to tqdm. The tqdm progress bar is initialized with postfix=dict(reward="?"), but the loop only calls pbar.update(1) without ever calling pbar.set_postfix() to update the reward display. The old code tracked reward_sum and reward_count and updated the postfix with the rolling average reward. Now users in non-TTY environments see "reward=?" throughout the entire evaluation, which is misleading and a regression from the previous behavior.

Additional Locations (1)

Fix in Cursor Fix in Web

@snimu snimu closed this Jan 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants