Skip to content

Conversation

@erikqu
Copy link
Contributor

@erikqu erikqu commented Jan 9, 2026

PR Type

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

📝 General Information

Description

This PR completes the verifiers environment implementation started in #258 by adding the missing training methods and fixing issues identified in review.

Related Issues

Supersedes/completes #258

Changes

environments/verifiers_server.py

  • Added collect_trajectories method for the training loop - generates multiple completions per question
  • Added score method for reward assignment using the Verifiers rubric system
  • Added wandb_log override for tracking train/percent_correct metric
  • Added name class attribute for proper CLI integration
  • Added percent_correct_buffer for training metrics tracking
  • Fixed imports - added Item, tokenize_for_trainer, random; removed unused imports
  • Replaced debug main() function with proper VerifiersEnv.cli() entry point

pyproject.toml

  • Moved verifiers>=0.1.5.post0 from core dependencies to optional [verifiers] group (install with pip install atroposlib[verifiers])

🔖 Environment Snapshot

Field Your Entry
Environment Name verifiers
Short Description Integration with Verifiers/Prime framework for structured reward evaluation
Category RL Environment
Dataset Needed? Via Verifiers (e.g., prime env install will/wordle)
External Deps verifiers>=0.1.5.post0 (optional)
Environmental Variables OPENAI_API_KEY
Compute Footprint Estimate Depends on underlying Verifiers environment

✅ Developer & Reviewer Checklist

  • Code follows project style (black, isort, flake8 pass with pre-commit)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • Docstrings added for all new public classes / functions

🤖 Generated with Claude Code

PR Link: #304

cdreetz and others added 9 commits October 9, 2025 23:48
- Add collect_trajectories method for training loop
- Add score method for reward assignment using Verifiers rubric
- Add wandb_log override for tracking percent_correct
- Move verifiers from core to optional dependencies
- Add name class attribute for CLI integration
- Replace debug main() with proper CLI entry point
- Fix unused imports and add required imports (Item, tokenize_for_trainer)
- Add percent_correct_buffer for training metrics

Co-Authored-By: Claude <noreply@anthropic.com>
@erikqu erikqu marked this pull request as draft January 9, 2026 08:18
@erikqu
Copy link
Contributor Author

erikqu commented Jan 9, 2026

Tests cover:
- VfEnvConfig configuration and inheritance
- VerifiersEnv initialization and config_init
- setup() method dataset loading
- get_next_item() iteration and wrapping
- score() method with rubric integration
- collect_trajectories() API call generation
- wandb_log() percent_correct calculation
- evaluate() test set evaluation
- Full training loop integration test

All 15 tests pass with mocked verifiers library.

Co-Authored-By: Claude <noreply@anthropic.com>
@erikqu erikqu force-pushed the verifiers-complete branch from 5bc19c5 to c00f126 Compare January 9, 2026 08:21
@erikqu erikqu changed the title Verifiers complete Integrate Prime Intellect Env Hub Jan 9, 2026
erikqu and others added 4 commits January 9, 2026 00:40
- Update to use _get_reward_funcs() and _get_reward_weights() (private API)
- Add _call_reward_func helper to call reward functions with correct signatures
- Relax math-verify constraint from ==0.7.0 to >=0.7.0 for compatibility
- Update tests to mock reward functions correctly
- Tested with PrimeIntellect wordle environment

The verifiers library >= 0.1.9 changed their public API:
- get_reward_funcs() -> _get_reward_funcs()
- get_reward_weights() -> _get_reward_weights()
- call_reward_func() removed, reward functions called directly

Co-Authored-By: Claude <noreply@anthropic.com>
- Add import guard for verifiers optional dependency
- Add compatibility layer for public/private API (verifiers >= 0.1.9)
- Fix response_messages to use list instead of tuple
- Add logging for exception handlers in reward functions
- Make reward_threshold configurable in VfEnvConfig
- Normalize messages to list before tokenize_for_trainer
- Add defensive checks for completion response (empty choices)
- Update tests for new reward_threshold config

Co-Authored-By: Claude <noreply@anthropic.com>
When a verifiers environment is not found, automatically attempt to
install it using the prime CLI. This makes it easier to use environments
without manual installation steps.

Co-Authored-By: Claude <noreply@anthropic.com>
@erikqu
Copy link
Contributor Author

erikqu commented Jan 9, 2026

  • assume you did prime login
  • prime intellect huv envs used as owner/env
  • e.g.
  uv run environments/verifiers_server.py serve \
    --env.vf_env_name will/wordle \
    --env.use_wandb False \
    --env.wandb_name wordle \
    --openai.api_key $OPENAI_API_KEY \
    --openai.model_name gpt-4.1-mini

then

Screen.Recording.2026-01-09.at.1.14.26.AM.mov

- Parse env_name to extract module name for loading (e.g., "will/wordle" -> "wordle")
- Use full name for prime CLI install, module name for vf.load_environment
- Add helpful error message when short format is used without install

Co-Authored-By: Claude <noreply@anthropic.com>
@erikqu erikqu marked this pull request as ready for review January 9, 2026 09:10
@teknium1
Copy link
Collaborator

teknium1 commented Jan 9, 2026

pinging @cdreetz for visibility as well

@teknium1
Copy link
Collaborator

teknium1 commented Jan 9, 2026

@erikqu can you look into sft-datagen to see if you can gen data with 4.1 that gets verified by the env your using from prime hub in the meantime

erikqu and others added 2 commits January 9, 2026 09:47
- Detect multi-turn envs via hasattr(vf_env, "env_response")
- Add _collect_multi_turn_trajectories using vf_env.rollout()
- Add _rollout_and_score_eval_multi_turn for evaluation
- Use rubric.score_rollout(state) for proper multi-turn scoring
- Add _get_vf_client() for AsyncOpenAI client creation
- Use pre-computed reward from state for multi-turn in score()
- Add timeout handling (120s) for multi-turn rollouts
- Improve logging for debugging API issues

Multi-turn environments like wordle now use the verifiers rollout
mechanism which properly handles the interaction loop and state-aware
reward functions.

Co-Authored-By: Claude <noreply@anthropic.com>
The verifiers rollout() expects a RolloutInput TypedDict with:
- prompt (messages list)
- example_id (int)
- task (str)
- answer (str)
- info (dict)

Was incorrectly passing 'question' field instead of the correct structure.
Also added WARNING level logs for better debugging visibility.

Co-Authored-By: Claude <noreply@anthropic.com>
@erikqu erikqu marked this pull request as draft January 9, 2026 18:37
@erikqu
Copy link
Contributor Author

erikqu commented Jan 9, 2026

put this back into a draft, bugs with sft

@erikqu erikqu marked this pull request as ready for review January 9, 2026 19:38
@erikqu
Copy link
Contributor Author

erikqu commented Jan 9, 2026

erikqu and others added 2 commits January 9, 2026 11:42
- Use real functions instead of MagicMock for reward functions to ensure
  proper signature inspection
- Add vf_env attribute with correct class name for single-turn detection
- Set ensure_scores_are_not_same config in tests

Co-Authored-By: Claude <noreply@anthropic.com>
@teknium1
Copy link
Collaborator

Wait I meant doing an environment from Verifiers (not atropos), when you run sft-datagen set a wandb run name and login to wandb, and then share that run please

@erikqu
Copy link
Contributor Author

erikqu commented Jan 10, 2026

Wait I meant doing an environment from Verifiers (not atropos), when you run sft-datagen set a wandb run name and login to wandb, and then share that run please

Oh sorry I didn't specify that it was with the prime intellect gsm8k, didn't think you guys obv also have it already, but just a test run e.g. (ignore wandb name)

Screenshot 2026-01-09 at 5 53 34 PM

@erikqu
Copy link
Contributor Author

erikqu commented Jan 10, 2026

Wait I meant doing an environment from Verifiers (not atropos), when you run sft-datagen set a wandb run name and login to wandb, and then share that run please

but sure I can share one later!

@erikqu
Copy link
Contributor Author

erikqu commented Jan 10, 2026

Wait I meant doing an environment from Verifiers (not atropos), when you run sft-datagen set a wandb run name and login to wandb, and then share that run please

but sure I can share one later!

@teknium1 can't seem to share the wandb project, maybe dm me your email or something

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants