Improve vf-eval progress display for multi-env metrics #731

snimu · 2026-01-14T21:05:41Z

Description

Show metrics during vf-eval rollout. On a narrow screen:

On a wide screen (before the first results come in):

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Note

Introduces a richer progress display for vf-eval with multi-environment awareness and live metric tracking.

Adds verifiers/utils/progress_utils.py with MultiEnvProgress and RolloutProgress to render per-env progress and rolling averages of reward and rubric metrics; includes TTY detection via can_render_rich_progress() and falls back to tqdm.
Updates Environment.generate to precompute per-task totals, discover metric names via env_map or fallback, initialize rich progress (multi-env or single), call update(states) as tasks complete, and cleanly start/stop; removes manual reward averaging/postfix logic.
Minor typing/cleanup adjustments in data_utils.py and message_utils.py (casts, removed type ignores).

^{Written by Cursor Bugbot for commit 6d3a02a. This will update automatically on new commits. Configure here.}

cursor · 2026-01-15T13:30:53Z

verifiers/envs/environment.py

-                if pbar is not None:
+                if rich_progress is not None:
+                    rich_progress.update(states)
+                elif pbar is not None:


Tqdm fallback no longer updates rolling average reward

Medium Severity

When the rich progress display cannot render (e.g., non-TTY environments like CI/CD), the code falls back to tqdm. The tqdm progress bar is initialized with postfix=dict(reward="?"), but the loop only calls pbar.update(1) without ever calling pbar.set_postfix() to update the reward display. The old code tracked reward_sum and reward_count and updated the postfix with the rolling average reward. Now users in non-TTY environments see "reward=?" throughout the entire evaluation, which is misleading and a regression from the previous behavior.

Additional Locations (1)

verifiers/envs/environment.py#L961-L962

snimu and others added 4 commits January 14, 2026 22:02

Improve vf-eval progress display for multi-env metrics

86d3402

ruff / ty

ee60766

ty fix

b318eab

Remove reward from progress bar

6d3a02a

cursor bot reviewed Jan 15, 2026

View reviewed changes

snimu closed this Jan 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve vf-eval progress display for multi-env metrics #731

Improve vf-eval progress display for multi-env metrics #731

snimu commented Jan 14, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve vf-eval progress display for multi-env metrics #731

Improve vf-eval progress display for multi-env metrics #731

Conversation

snimu commented Jan 14, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Uh oh!

cursor bot Jan 15, 2026

Choose a reason for hiding this comment

Tqdm fallback no longer updates rolling average reward

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

snimu commented Jan 14, 2026 •

edited by cursor bot

Loading