OpenRubric: Multi‑Step Rubric Evaluation

OpenRubric is a multi-step rubric evaluation framework built on top of verifiers. OpenRubric is built on the idea that real-world problems are typically full of branching logic, and that single-step rubric evaluation is insufficient for evaluating complex problems. OpenRubric extends verifiers to support multi-step rubrics that can mix judges and verifiable rewards in single- or multi-turn scenarios.

We also develop a synthetic data generation pipeline that can generate hidden descriptions and full scenarios from any rubric. Additionally, we create a GUI builder for rubrics that can be used to create and edit rubrics for easier access for researchers. Finally, we show example training runs using OpenRubric on PrimeIntellect nodes.

Most functionality is built into verifiers/rubrics/multistep/ and verifiers/envs/multistep_env.py. multistep_extras/ contains tools for developing and visualizing rubrics, generating synthetic data, example rubrics, and training on PrimeIntellect.

Features

Branching rubrics: Requirements connected as a DAG; paths advance on correct judgments
Judge‑driven progression: External judge determines correctness and reveals next steps
Flexible scoring: Level‑weighted, progressive, mean, sum, and more
GUI builder: Visual editor and YAML import/export
Synthetic data: Generate hidden descriptions and full scenarios; optional HF Hub push
Training: Example training runs using OpenRubric on PrimeIntellect

Installation

Click to expand

git clone https://github.com/jacobphillips99/open-rubric.git
cd open-rubric
uv sync --extra all

# Optional: GPU‑optimized attention
uv pip install flash-attn --no-build-isolation

Configure credentials as needed (OPENAI, ANTHROPIC, wandb, huggingface-cli). For remote setup, see install.sh.

Core Concepts

MultiStepRubric: A DAG of named requirements mixing judges and verifiable rewards
MultiStepMultiTurnEnv: A turn-based environment that progressively reveals information as Requirements are satisfied
Requirement: A question to evaluate model responses (such as binary or discrete/continuous), with conditional dependencies
Scenario: Prompt + model completion + ground‑truth answers with an optional hidden seed description
Judge: Scores a requirement given the prompt/completion/answer
Reward strategy: Aggregates per‑requirement results into a single score

Quick Start

Minimal branching example with two paths and an async evaluation call you can run directly.

First, setup the Requirement objects, which contain conditional dependencies and a question to evaluate.

# Define a tiny workflow
requirements = [
    BinaryRequirement(
        name="scene_safety",
        question="Does the response prioritize scene safety before approaching?",
        dependencies={
            1.0: ["patient_consciousness"],
            0.0: [],
        },
    ),
    BinaryRequirement(
        name="patient_consciousness",
        question="Does the response assess if the patient is conscious?",
        dependencies={
            1.0: ["communication"],
            0.0: ["airway_management"],
        },
    ),
    BinaryRequirement(name="communication", question="Do they communicate with the patient?"),
    BinaryRequirement(name="airway_management", question="Do they secure the airway?"),
]

Next, we create a MultiStepRubric container, which organizes the requirements into a DAG and adds judge and verifiable reward objects.

rubric = MultiStepRubric(requirements, [BinaryJudgeRewarder(JUDGE_PROMPT)])

We then create a Scenario object, which contains a prompt, model completion, and ground-truth answers.

scenario = Scenario(
    prompt="You arrive at a minor crash; the scene is safe.",
    completion="I confirm safety, speak to the patient, and proceed.",
    answers={
        "scene_safety": {"answer": 1.0},
        "patient_consciousness": {"answer": 1.0},
        "communication": {"answer": 1.0},
    },
)

Now we can create a Environment object, which contains all the information and logic for evaluating the scenario according to a model completion and rubric.

env = MultiStepMultiTurnEnv(multistep_rubric=rubric)
completion, state = env.rollout(client, model, prompt, answer, kwargs)
scores = rubric.score_rollout(prompt, completion, answers, state)

The environment queries a policy model to get the model completion. The model completion is then evaluated according to the rubric.

Examples

These tiny, runnable snippets illustrate why multistep, judge‑driven evaluation and explicit formats matter. They avoid any external API calls.

Branching requirements in 12 lines

from verifiers.rubrics.multistep.requirement import BinaryRequirement

# A root requirement that branches to different next steps depending on the answer
scene_safety = BinaryRequirement(
    name="scene_safety",
    question="Does the responder confirm scene safety first?",
    dependencies={
        1.0: ["communicate"],   # if correct → proceed to communication
        0.0: ["airway"],        # if incorrect → force airway management
    },
)

The rubric encodes domain logic as a DAG. The evaluation doesn’t just “score once”; it routes the workflow, unlocking different downstream checks based on the exact scenario.

Explicit judge formats catch errors early

from verifiers.rewards.judge_utils import binary_judge_response_format

print(str(binary_judge_response_format).split('\n')[0])  # Shows exact JSON format required

# Good: matches schema and options
good = '{"answer": 1.0, "reasoning": "Responder confirmed safety."}'
print(binary_judge_response_format.convert(good))  # JudgeResponse(answer=1.0, ...)

# Bad: out-of-range value raises a helpful error
bad = '{"answer": 2.0, "reasoning": "Not allowed"}'
try:
    binary_judge_response_format.convert(bad)
except Exception as e:
    print(type(e).__name__, str(e)[:80])

Why it’s interesting: Judges must return structured JSON, so you get reliable, debuggable signals instead of brittle free‑form text.

Generate synthetic datasets (hidden → scenarios)

Turn rubric‑aware hidden descriptions into validated scenarios ready for analysis or Hub push.
- Start with rubric‑consistent hidden seed descriptions
- Expand each seed into full scenarios (prompt + per‑requirement answers), preserving branching
- Validate against judge schemas and requirement names; drop invalid/duplicate items
- Save two splits: hidden_descriptions.json and synthetic_scenarios.{yaml,jsonl}
From synthetic → verifiers training on PrimeIntellect

Train with GRPO using MultiStepMultiTurnEnv; the policy rolls out through branched steps, judges score requirements, and rewards aggregate to a scalar.
- Run an inference server (e.g., vLLM) and a trainer process
- Load your rubric and scenarios; perform rubric‑aware rollouts
- Judges produce per‑requirement scores; a reward strategy aggregates them
- GRPO uses rewards + logprobs to update the policy, logging metrics and checkpoints
See verifiers/examples/first_responder_train.py for runnable commands.

Workflow Builder (GUI)

It can be difficult to create and edit rubrics by hand. To make this easier, we have created a GUI builder that can be used to create and edit rubrics. Simply run the following command to start the builder:

streamlit run multistep_extras/builders/rubric_gui.py

The builder supports visual editing, judge configuration, reward strategy selection, and YAML import/export.

The sidebar allows you to load and save rubrics, and the main panel allows you to build judges, create requirements, compare reward strategies, and even develop new scenarios. We also visualize the paths through the rubric and examine initial, branching, and terminal states.

Synthetic Data Generation

Generate scenarios from any rubric. See multistep_extras/synthetic/README.md for details.

Outputs include two splits: hidden (hidden descriptions) and scenarios (full scenarios). Local JSONL files are written under outputs/hf/.

Training on Prime Intellect

We adapt the train examples from verifiers and train better rubric-following policy models. We focus on training for the first_responder rubric, which simulates a complex, real-world difficult medical decision-making framework. Following the instructions in verifiers, we reccomend renting Prime Intellect nodes.

Tips and Tricks:

We use 2xH100s for setup and testing and 8xH100s for training; reccomend pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime as the base image. If you're going to run many experiments, you may want to set up a Template.
ssh -A to forward your local SSH agent to the remote node
NCCL_SHM_DISABLE=1 NCCL_DEBUG=INFO NCCL_P2P_DISABLE=1 to avoid NCCL issues and for easier debugging
Don't forget to set OPENAI_API_KEY and WANDB_API_KEY
You may need to eval "$(ssh-agent -s)" && ssh-add path/to/your/ssh/key

For remote setup:

git clone https://github.com/jacobphillips99/open-rubric.git
cd open-rubric
chmod +x install.sh
./install.sh

This runs the one-shot setup: installs uv and basic tools, configures tmux, installs project dependencies (optional flash-attn), and installs OpenRubric. After this, simply open tmux shells to run the providing training arguments (one window with vf-vllm ... and another with accelerate ...).

Citation

@software{openrubric2025,
  title={OpenRubric: Multi-Step Rubric Evaluation Framework},
  author={Phillips, Jacob},
  year={2025},
  url={https://github.com/jacobphillips99/open-rubric}
}

Acknowledgments

This project was developed by Jacob Phillips as a part of the Andreessen Horowitz American Dynamism Engineering Fellows program. Special thanks to the American Dynamism team for their support and feedback on the project.

OpenRubric is built on top of verifiers by Will Brown. Thank you to Will and the rest of the PrimeIntellect team for open-sourcing their work.

If using OpenRubric in your work, please cite it to acknowledge the authors. Suggested format:

@software{openrubric2025,
    title = {OpenRubric: Multi-Step Rubric Evaluation Framework},
    author = {Jacob Phillips},
    url = {https://github.com/jacobphillips99/open-rubric},
    version = {insert version number},
    date = {insert date of usage},
    year = {2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 757 Commits
configs		configs
docs		docs
environments		environments
example_rubrics		example_rubrics
examples		examples
multistep_extras		multistep_extras
tests		tests
verifiers		verifiers
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
install.sh		install.sh
pyproject.toml		pyproject.toml
rubric_gui.png		rubric_gui.png
rubric_viz.png		rubric_viz.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenRubric: Multi‑Step Rubric Evaluation

Features

Installation

Core Concepts

Quick Start

Examples

Workflow Builder (GUI)

Synthetic Data Generation

Training on Prime Intellect

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenRubric: Multi‑Step Rubric Evaluation

Features

Installation

Core Concepts

Quick Start

Examples

Workflow Builder (GUI)

Synthetic Data Generation

Training on Prime Intellect

Citation

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages