A bridge to convert Inspect AI tasks into Verifiers environments for RL training with prime-rl.
Inspect AI is a framework for evaluating LLMs with a rich ecosystem of evaluation tasks. This bridge allows you to:
- Import existing Inspect tasks and train on them with prime-rl
- Preserve Inspect scoring semantics as Verifiers reward functions
- Support sandbox-based scoring (Docker, local) for code execution tasks
- Multi-turn agentic environments with bash and submit tools
- Convert Inspect datasets to HuggingFace datasets
uv add inspect-verifiers-bridgeOr for development:
git clone <repo>
cd inspect-verifiers-bridge
uv syncfrom inspect_evals.humaneval import humaneval
from inspect_verifiers_bridge import load_environment
# Load an Inspect task as a Verifiers environment
env = load_environment(
humaneval,
env_type="single_turn", # or "multi_turn" for agentic tasks
scoring_mode="live", # Use Inspect's native scorers
sandbox_type="local", # Use local sandbox for code execution
max_samples=100, # Limit dataset size
)
# The environment is ready for training
print(f"Environment: {type(env).__name__}")
print(f"Dataset size: {len(env.dataset)}")The examples/ directory contains vf-eval compatible scripts for each environment type:
# Single turn, no sandbox (GSM8K math reasoning)
vf-eval gsm8k_example -p examples/ -m gpt-4o-mini -n 10
# Single turn with sandbox (HumanEval code generation)
vf-eval humaneval_example -p examples/ -m gpt-4o-mini -n 10
# Multi-turn with tools (HumanEval agentic)
vf-eval humaneval_multiturn_example -p examples/ -m gpt-4o-mini -n 5Each example exports a load_environment() function that vf-eval can call:
| Example | Environment | Tools | Use Case |
|---|---|---|---|
gsm8k_example.py |
SingleTurnEnv |
None | Math reasoning |
humaneval_example.py |
InspectSandboxEnv |
_bash |
Code generation |
humaneval_multiturn_example.py |
InspectSandboxEnv |
_bash, _submit |
Agentic coding |
Main function to convert an Inspect task to a Verifiers environment.
def load_environment(
task: Callable[..., Task],
*,
scoring_mode: Literal["live", "custom"] = "live",
custom_reward_fn: Callable[..., float] | None = None,
env_type: Literal["single_turn", "multi_turn"] = "single_turn",
max_samples: int | None = None,
max_turns: int = 10,
sandbox_type: str | None = None,
sandbox_config: str | None = None,
include_bash: bool = True,
include_submit: bool | None = None,
**task_kwargs,
) -> vf.Environment:Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
task |
Callable[..., Task] |
required | Inspect task function (e.g., humaneval from inspect_evals) |
scoring_mode |
"live" | "custom" |
"live" |
Use Inspect scorers directly or provide custom reward |
custom_reward_fn |
Callable |
None |
Custom reward function (required if scoring_mode="custom") |
env_type |
"single_turn" | "multi_turn" |
"single_turn" |
Environment type (multi_turn requires sandbox) |
max_samples |
int |
None |
Limit number of samples from dataset |
max_turns |
int |
10 |
Max turns for multi-turn environments |
sandbox_type |
str |
None |
Override sandbox type ("docker", "local") |
sandbox_config |
str |
None |
Path to sandbox config file |
include_bash |
bool |
True |
Include bash tool in sandbox environments |
include_submit |
bool |
None |
Include submit tool (auto: True for multi_turn) |
**task_kwargs |
Any |
- | Arguments passed to the Inspect task function |
Environment Selection:
| env_type | sandbox | Result |
|---|---|---|
single_turn |
No | SingleTurnEnv |
single_turn |
Yes | InspectSandboxEnv(max_turns=1) |
multi_turn |
No | NotImplementedError |
multi_turn |
Yes | InspectSandboxEnv(max_turns=N) with submit tool |
Load and introspect an Inspect task without converting it.
from inspect_verifiers_bridge.tasks import load_inspect_task
task_info = load_inspect_task(humaneval)
print(f"Task: {task_info.name}")
print(f"Sandbox: {task_info.sandbox_type}")
print(f"Scorers: {len(task_info.scorers)}")Uses Inspect's native scorers directly. Supports all built-in scorers (exact, includes, match, model_graded_fact, etc.) and custom scorers.
env = load_environment(
my_task,
scoring_mode="live",
sandbox_type="local", # or "docker" for isolated execution
)Provide your own reward function:
def my_reward(prompt, completion, answer, state, **kwargs):
# prompt: list of message dicts
# completion: list of message dicts (model response)
# answer: expected answer string
# state: dict containing 'info' with Inspect metadata
return 1.0 if answer in str(completion) else 0.0
env = load_environment(
my_task,
scoring_mode="custom",
custom_reward_fn=my_reward,
)For tasks that require code execution (like HumanEval, APPS), the bridge supports:
- Docker sandbox: Full isolation, recommended for untrusted code
- Local sandbox: Faster, runs code directly on host
# Docker sandbox (default for tasks that specify sandbox="docker")
env = load_environment(humaneval, sandbox_type="docker")
# Local sandbox (faster, less isolated)
env = load_environment(humaneval, sandbox_type="local")When using InspectSandboxEnv, each rollout gets a fresh sandbox:
- setup_state(): Creates sandbox for this rollout
- Rollout loop: Model interacts with bash/submit tools
- Scoring: Scorer runs with sandbox context
- cleanup(): Sandbox destroyed after scoring completes
This ensures no state contamination between rollouts and supports concurrent execution via asyncio.gather().
For agentic tasks, use env_type="multi_turn":
env = load_environment(
humaneval,
env_type="multi_turn",
max_turns=10,
sandbox_type="local",
)Available tools:
| Tool | Description |
|---|---|
_bash |
Execute bash commands in the sandbox |
_submit |
Submit final answer and end the rollout |
The model can use these tools iteratively until it calls _submit or reaches max_turns.
The bridge converts Inspect Sample objects to HuggingFace dataset rows:
| Field | Type | Description |
|---|---|---|
prompt |
list[dict] |
List of messages (always includes system prompt) |
answer |
str | None |
Target answer (converted to string) |
id |
str | int |
Sample identifier (auto-generated if not set) |
info |
dict |
All Inspect metadata preserved |
The prompt field is always a list of message dicts with role and content keys. For chat inputs with tool calls, it also preserves tool_calls and tool_call_id.
The info dict contains:
inspect_sample_id: Original sample IDinspect_input_raw: Original input (pre-solver)inspect_target_raw: Original target (may be list, dict, etc.)inspect_choices: Multiple choice optionsinspect_metadata: Sample metadatainspect_sandbox: Per-sample sandbox configinspect_files: Files to copy into sandboxinspect_setup: Setup scriptinspect_task_name: Task name
| Feature | Status | Notes |
|---|---|---|
| String input/output | β | Full support |
| Chat message input | β | Converted to message dicts |
| Multiple choice | β | Choices preserved in info |
| Exact/includes/match scorers | β | Full support |
| Model-graded scorers | β | Requires API access |
| Sandbox scoring | β | Docker and local |
| Custom scorers | β | Full support |
| Single-turn environments | β | SingleTurnEnv or InspectSandboxEnv |
| Multi-turn with tools | β | InspectSandboxEnv with bash/submit |
| Per-rollout sandbox lifecycle | β | Fresh sandbox per rollout |
| Multi-agent | β | Out of scope |
Run the test suite:
uv run pytest tests/ -vTests cover:
- Dataset conversion (preserving all fields)
- Scoring comparison (bridge vs native Inspect)
- Environment creation
- Sandbox scoring (local and Docker)
- Concurrent sandbox execution
- Edge cases
inspect_verifiers_bridge/
βββ __init__.py # Public API (load_environment)
βββ loader.py # Main loader and environment selection
βββ environment.py # InspectSandboxEnv with per-rollout lifecycle
βββ tasks.py # Task introspection utilities
βββ dataset.py # Sample β HuggingFace dataset conversion
βββ ground_truth.py # Solver execution for prompt construction
βββ scoring.py # Inspect scorer β Verifiers rubric bridge
βββ sandbox.py # Sandbox creation and context management
examples/
βββ gsm8k_example.py # Single turn, no sandbox
βββ humaneval_example.py # Single turn with sandbox
βββ humaneval_multiturn_example.py # Multi-turn with tools
This section provides a detailed walkthrough of what happens when you call load_environment().
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β load_environment(task_fn) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. TASK INTROSPECTION β
β load_inspect_task(task_fn) β InspectTaskInfo β
β Extracts: scorers, sandbox_type, solver_has_tools β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. DATASET CONVERSION β
β inspect_dataset_to_hf(task, task_name) β HuggingFace Dataset β
β Runs solver pipeline (without model) to get ground truth prompts β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. RUBRIC CREATION β
β build_rubric_from_scorers(scorers) β Verifiers Rubric β
β Wraps Inspect scorers in reward functions with sandbox context β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. ENVIRONMENT CREATION β
β Based on env_type and sandbox: β
β - SingleTurnEnv (no sandbox, single turn) β
β - InspectSandboxEnv (sandbox, single or multi-turn) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Entry Point: loader.py
task_info = tasks.load_inspect_task(task, **task_kwargs)What happens in load_inspect_task():
# tasks.py - Invoke the task function to get a Task object
task = task_fn(**task_kwargs)
# Extract sandbox type
sandbox_type = None
if task.sandbox is not None:
if isinstance(task.sandbox, str):
sandbox_type = task.sandbox # e.g., "docker"
elif hasattr(task.sandbox, "type"):
sandbox_type = task.sandbox.type # SandboxSpec object
# Normalize scorers to a list
scorers: list[Scorer] = []
if task.scorer is not None:
if isinstance(task.scorer, list):
scorers = task.scorer
else:
scorers = [task.scorer]Returns: InspectTaskInfo dataclass with:
task: The Inspect Task objectname: Task name (e.g., "humaneval")dataset: Inspect Datasetscorers: List of scorer functionssandbox_type: "docker" | "local" | Nonesolver_has_tools: bool
Entry Point: loader.py
hf_dataset = ds.inspect_dataset_to_hf(
task_info.task,
task_name=task_info.name,
max_samples=max_samples,
)What happens in inspect_dataset_to_hf():
For each sample, the solver pipeline is executed (without calling the model) to produce the ground truth prompt:
- Ground truth execution: Runs all prompt-engineering solvers (
system_message,prompt_template,chain_of_thought, etc.) stopping beforegenerate() - Message extraction: Extracts the resulting
state.messageslist - Serialization: Converts
ChatMessageobjects to dicts
Auto-ID generation: Samples without IDs (like GSM8K) automatically get index-based IDs.
Output row structure:
{
"prompt": [ # Always a list of messages
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a function to add two numbers..."},
],
"answer": "def add(a, b): return a + b", # String target
"info": {
"inspect_sample_id": "0",
"inspect_input_raw": "Write a function...", # Pre-solver input
"inspect_target_raw": "def add...",
"inspect_metadata": {},
"inspect_sandbox": None,
"inspect_files": {},
"inspect_setup": None,
"inspect_task_name": "humaneval",
},
"id": "0",
}Entry Point: loader.py
if scoring_mode == "live":
rubric = scoring.build_rubric_from_scorers(task_info.scorers)
elif scoring_mode == "custom":
rubric = vf.Rubric(funcs=[custom_reward_fn])Building reward functions from scorers:
# scoring.py
reward_funcs = []
for i, scorer in enumerate(scorers):
# Wrap scorer in a partial function
func = partial(reward_from_inspect_scorer, scorer=scorer)
# Extract unique name from __qualname__
# e.g., "verify.<locals>.score" β "verify"
scorer_name = _get_scorer_name(scorer)
# Add index for uniqueness (prevents metric overwriting)
func.__name__ = f"inspect_{scorer_name}_{i}"
reward_funcs.append(func)
return vf.Rubric(funcs=reward_funcs)Reward function flow (during training):
reward_from_inspect_scorer(prompt, completion, answer, state)
β
βββ Build TaskState from Verifiers state
β - Uses state["info"]["inspect_input_raw"] for TaskState.input
β - Converts messages to Inspect ChatMessage objects
β
βββ Get sandbox context from state["_sandbox_envs"] (if present)
β
βββ Call Inspect scorer within sandbox context
β
βββ Convert Score to float (0.0-1.0)
Critical: ContextVar Setup for Concurrent Rollouts
When scoring with sandboxes, the sandbox_context() sets all three ContextVars that Inspect expects:
# sandbox.py
async with sandbox_context(sandboxes):
# Sets these ContextVars:
sandbox_environments_context_var.set(sandboxes)
sandbox_default_context_var.set(default_name)
sandbox_with_environments_context_var.set({})
# Now sandbox() calls inside scorer will work
yield sandboxesWhy this matters: Verifiers runs multiple rollouts concurrently via
asyncio.gather(). Each coroutine has its own ContextVar context. Without setting all three ContextVars per-coroutine, only the first rollout succeeds.
Entry Point: loader.py
if env_type == "single_turn":
if effective_sandbox_type:
return InspectSandboxEnv(
dataset=hf_dataset,
rubric=rubric,
sandbox_config=SandboxConfig(...),
task_name=task_info.name,
max_turns=1,
include_bash=include_bash,
include_submit=False,
)
return vf.SingleTurnEnv(dataset=hf_dataset, rubric=rubric)
elif env_type == "multi_turn":
if not effective_sandbox_type:
raise NotImplementedError("Multi-turn requires sandbox")
return InspectSandboxEnv(
dataset=hf_dataset,
rubric=rubric,
sandbox_config=SandboxConfig(...),
task_name=task_info.name,
max_turns=max_turns,
include_bash=include_bash,
include_submit=True, # Always for multi-turn
)InspectSandboxEnv extends vf.StatefulToolEnv and manages per-rollout sandbox lifecycle:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Rollout Lifecycle β
β β
β 1. setup_state(state) β
β βββ create_sandbox_for_sample() β SandboxInstance β
β βββ state["_sandbox_envs"] = instance.environments β
β β
β 2. Rollout loop (until @vf.stop triggers) β
β βββ Model generates response β
β βββ If tool_calls in response: β
β β βββ env_response() β calls _bash/_submit via update_tool_args() β
β βββ Check stop conditions: β
β βββ max_turns_reached β
β βββ answer_submitted (if _submit was called) β
β β
β 3. Scoring β
β βββ Scorer accesses sandbox via state["_sandbox_envs"] β
β β
β 4. @vf.cleanup: destroy_sandbox(state) β
β βββ cleanup_sandbox(instance) β removes container/files β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tools implementation:
class InspectSandboxEnv(vf.StatefulToolEnv):
async def _bash(self, command: str, state) -> str:
"""Execute bash command in sandbox."""
sandbox = next(iter(state["_sandbox_envs"].values()))
result = await sandbox.exec(cmd=["bash", "-c", command])
return result.stdout or "(no output)"
async def _submit(self, answer: str, state) -> str:
"""Submit answer and trigger rollout end."""
state["_submitted_answer"] = answer
return f"Answer submitted: {answer}"
@vf.stop(priority=10)
async def answer_submitted(self, state) -> bool:
"""Stop when model calls submit tool."""
return "_submitted_answer" in statefrom inspect_evals.humaneval import humaneval
from inspect_verifiers_bridge import load_environment
# This call triggers the entire flow above
env = load_environment(
humaneval, # Step 1: Introspect task
env_type="multi_turn", # Step 4: Create InspectSandboxEnv
scoring_mode="live", # Step 3: Use Inspect scorers
sandbox_type="local", # Step 4: Configure sandbox
max_samples=10, # Step 2: Limit dataset
max_turns=5, # Step 4: Configure turns
)
# Result:
# - env.dataset: HuggingFace Dataset with 10 samples
# - env.rubric: Verifiers Rubric wrapping humaneval's verify() scorer
# - env.oai_tools: [_bash, _submit] tools in OpenAI format
# - Each rollout gets fresh sandbox, cleaned up after scoring# Install dev dependencies
uv sync
# Run linting
uv run ruff check .
uv run basedpyright .
# Run tests
uv run pytest tests/ -v
# Format code
uv run ruff format .