Skip to content

Add more evals#51

Merged
mruwnik merged 6 commits intomainfrom
more-evals
Feb 10, 2026
Merged

Add more evals#51
mruwnik merged 6 commits intomainfrom
more-evals

Conversation

@mruwnik
Copy link
Collaborator

@mruwnik mruwnik commented Jan 1, 2026

Changes

Implements the following evaluations:
WMDP
Bio-chem-cyber-propensity
Policy orientated benchmark
XS Test

This also extracts some common generation/metrics functions and changes stuff to use vllm

Testing

I extracted the model and reported result from each evaluation paper, then ran that eval on that model to see if the result is the same(ish). I was somewhat stricter than needed, below are the results:

Eval Status Key Metric Error Notes
WMDP ✅ PASSED Bio: 62.55%, Cyber: 43.14%, Chem: 45.05% <2% Fixed answer extraction
MT-Bench ✅ PASSED Score: 7.04 (expected 7.33) 3.95% Fixed judge model to gpt-4-0613
XSTest ❌ FAILED Safe: 1.6%, Unsafe: 36.0% 0% overall Overall rates match perfectly, but partial/full classification differs
PolicyEval ❌ FAILED Harm score: 1.02 (expected 1.05) 3.28% OK, 19.16% rate Harm score OK, harm rate threshold slightly different
StrongREJECT ❌ FAILED Score: 0.049 (expected 0.08) 39% Model MORE safe than expected

"""

judge_model: str = "gpt-4o"
judge_model: str = "gpt-4-0613" # Match reference: WMDP paper uses GPT-4-0613
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: this is saying, the WMDP paper's use of MT-bench uses GPT-4-0613?

I do also wonder whether we should still just stick with 4o, it's cheaper, and while I'm having trouble finding information on how gpt-4-0613 performs on benchmarks, I'm under the impression that gpt-4o performs similarly to gpt-4


Args:
eval_config: Configuration for the evaluation.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, why did the Args: and Returns: get removed from all the docstrings throughout the file? is it that the superclass already documents the params?

same question in strong_reject.py as well

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note there will unfortunately be a bunch of merge conflicts to resolve — some may be from #58 but most will be from #53

"""Configuration for Policy-Oriented Safety Evaluation.

Attributes:
judge_model: Model to use for LLM-as-a-Judge scoring (default: gpt-4o).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file header docstring says we use GPT-4, as does the class docstring on class PolicyEvaluation and function docstring compute_scores(), but here we actually use gpt-4o.

update headers to say 4o instead? Or just don't reference the specific GPT model in the docstrings, since it's configurable, just say we use an LLM-as-judge, and the specific model can be documented here in the PolicyEvaluationConfig

)
content = response.choices[0].message.content or ""
match = re.search(score_pattern, content, re.IGNORECASE)
score = float(match.group(1)) if match else default_score
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should log some kind of judge failure rather than silently giving the default_score. Then whoever is processing the data can decide what to do with the failures — either filter out the failures, or replace them with some default score.

Comment on lines +376 to +377
# Last resort: default to first choice
best_choice = choices[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly to judge failures, I wonder if we should log failures explicitly and leave it to the analysis afterwards to determine what to do with them. Though I do think defaulting to random guessing is probably reasonable

self, scores: DataFrame[ScoreSchema]
) -> DataFrame[EvaluationSchema]:
"""Aggregate scores into refusal rate metrics."""
scores_df = scores.to_pandas()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is computing to pandas necessary? My impression is that polars is meant to mostly replace pandas

This evaluation implements the policy-oriented safety benchmark from Qi et al. (2023),
which measures harmfulness of model outputs using GPT-4 as a judge.

The benchmark consists of 330 harmful instructions across 11 categories based on
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is https://huggingface.co/datasets/LLM-Tuning-Safety/HEx-PHI, though I grepped for HEx-PHI in the codebase and didn't see it — where is the dataset loaded?

return ScoreSchema.validate(pl.DataFrame(scores))

# Use LLM judge
def parse_class(content: str) -> float:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused

config = WMDPEvaluationConfig(
model_checkpoint=model_id,
out_dir=out_dir,
max_generation_length=1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can set the default max_generation_length to be 1 in WMDPEvaluationConfig since so long as WMDP is using logprobs, we always want to set max_generation_length=1. Not sure if that's actually possible though, can a subclass set a default value to a parent class's arg? So maybe no action needed here.

@mruwnik mruwnik merged commit ca1009d into main Feb 10, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants