Add more evals by mruwnik · Pull Request #51 · criticalml-uw/TamperBench

mruwnik · 2026-01-01T13:44:55Z

Changes

Implements the following evaluations:
WMDP
Bio-chem-cyber-propensity
Policy orientated benchmark
XS Test

This also extracts some common generation/metrics functions and changes stuff to use vllm

Testing

I extracted the model and reported result from each evaluation paper, then ran that eval on that model to see if the result is the same(ish). I was somewhat stricter than needed, below are the results:

Eval	Status	Key Metric	Error	Notes
WMDP	✅ PASSED	Bio: 62.55%, Cyber: 43.14%, Chem: 45.05%	<2%	Fixed answer extraction
MT-Bench	✅ PASSED	Score: 7.04 (expected 7.33)	3.95%	Fixed judge model to gpt-4-0613
XSTest	❌ FAILED	Safe: 1.6%, Unsafe: 36.0%	0% overall	Overall rates match perfectly, but partial/full classification differs
PolicyEval	❌ FAILED	Harm score: 1.02 (expected 1.05)	3.28% OK, 19.16% rate	Harm score OK, harm rate threshold slightly different
StrongREJECT	❌ FAILED	Score: 0.049 (expected 0.08)	39%	Model MORE safe than expected

tomtseng · 2026-01-28T22:45:30Z

src/safetunebed/whitebox/evals/mt_bench/mt_bench.py

    """

-    judge_model: str = "gpt-4o"
+    judge_model: str = "gpt-4-0613"  # Match reference: WMDP paper uses GPT-4-0613


To clarify: this is saying, the WMDP paper's use of MT-bench uses GPT-4-0613?

I do also wonder whether we should still just stick with 4o, it's cheaper, and while I'm having trouble finding information on how gpt-4-0613 performs on benchmarks, I'm under the impression that gpt-4o performs similarly to gpt-4

tomtseng · 2026-01-28T22:46:57Z

src/tamperbench/whitebox/evals/mt_bench/mt_bench.py

-
-        Args:
-            eval_config: Configuration for the evaluation.
-        """


hmm, why did the Args: and Returns: get removed from all the docstrings throughout the file? is it that the superclass already documents the params?

same question in strong_reject.py as well

tomtseng · 2026-01-28T23:08:01Z

src/tamperbench/whitebox/evals/mt_bench/mt_bench.py

note there will unfortunately be a bunch of merge conflicts to resolve — some may be from #58 but most will be from #53

tomtseng · 2026-01-28T23:11:50Z

src/tamperbench/whitebox/evals/policy_eval/policy_eval.py

+    """Configuration for Policy-Oriented Safety Evaluation.
+
+    Attributes:
+        judge_model: Model to use for LLM-as-a-Judge scoring (default: gpt-4o).


file header docstring says we use GPT-4, as does the class docstring on class PolicyEvaluation and function docstring compute_scores(), but here we actually use gpt-4o.

update headers to say 4o instead? Or just don't reference the specific GPT model in the docstrings, since it's configurable, just say we use an LLM-as-judge, and the specific model can be documented here in the PolicyEvaluationConfig

tomtseng · 2026-01-28T23:16:19Z

src/safetunebed/whitebox/evals/utils.py

+            )
+            content = response.choices[0].message.content or ""
+            match = re.search(score_pattern, content, re.IGNORECASE)
+            score = float(match.group(1)) if match else default_score


I wonder if we should log some kind of judge failure rather than silently giving the default_score. Then whoever is processing the data can decide what to do with the failures — either filter out the failures, or replace them with some default score.

tomtseng · 2026-01-28T23:32:34Z

src/tamperbench/whitebox/evals/utils.py

+                # Last resort: default to first choice
+                best_choice = choices[0]


similarly to judge failures, I wonder if we should log failures explicitly and leave it to the analysis afterwards to determine what to do with them. Though I do think defaulting to random guessing is probably reasonable

tomtseng · 2026-01-28T23:34:50Z

src/safetunebed/whitebox/evals/xstest/xstest.py

+        self, scores: DataFrame[ScoreSchema]
+    ) -> DataFrame[EvaluationSchema]:
+        """Aggregate scores into refusal rate metrics."""
+        scores_df = scores.to_pandas()


is computing to pandas necessary? My impression is that polars is meant to mostly replace pandas

tomtseng · 2026-01-28T23:37:24Z

src/safetunebed/whitebox/evals/policy_eval/__init__.py

+This evaluation implements the policy-oriented safety benchmark from Qi et al. (2023),
+which measures harmfulness of model outputs using GPT-4 as a judge.
+
+The benchmark consists of 330 harmful instructions across 11 categories based on


I believe this is https://huggingface.co/datasets/LLM-Tuning-Safety/HEx-PHI, though I grepped for HEx-PHI in the codebase and didn't see it — where is the dataset loaded?

tomtseng · 2026-01-29T00:21:52Z

src/safetunebed/whitebox/evals/xstest/xstest.py

+            return ScoreSchema.validate(pl.DataFrame(scores))
+
+        # Use LLM judge
+        def parse_class(content: str) -> float:


tomtseng · 2026-01-29T00:24:28Z

tests/evals/test_reference_validation.py

+    config = WMDPEvaluationConfig(
+        model_checkpoint=model_id,
+        out_dir=out_dir,
+        max_generation_length=1,


I wonder if we can set the default max_generation_length to be 1 in WMDPEvaluationConfig since so long as WMDP is using logprobs, we always want to set max_generation_length=1. Not sure if that's actually possible though, can a subclass set a default value to a parent class's arg? So maybe no action needed here.

Add more evals

ccc79c7

mruwnik requested review from esveee, sdhossain and tomtseng January 2, 2026 13:19

tester

3ddc065

mruwnik force-pushed the more-evals branch from db85abd to 3ddc065 Compare January 2, 2026 13:23

tomtseng approved these changes Jan 29, 2026

View reviewed changes

mruwnik added 3 commits February 9, 2026 16:15

Merge branch 'main' into more-evals

0f26365

PR fixes

8291657

Merge branch 'main' into more-evals

5c95de3

mruwnik force-pushed the more-evals branch from 558c196 to 35c9d27 Compare February 10, 2026 16:56

fix linting

379ee48

mruwnik force-pushed the more-evals branch from 35c9d27 to 379ee48 Compare February 10, 2026 17:02

mruwnik merged commit ca1009d into main Feb 10, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more evals#51

Add more evals#51
mruwnik merged 6 commits intomainfrom
more-evals

mruwnik commented Jan 1, 2026 •

edited

Loading

Uh oh!

tomtseng Jan 28, 2026

Uh oh!

tomtseng Jan 28, 2026

Uh oh!

tomtseng Jan 28, 2026

Uh oh!

tomtseng Jan 28, 2026

Uh oh!

tomtseng Jan 28, 2026

Uh oh!

tomtseng Jan 28, 2026

Uh oh!

tomtseng Jan 28, 2026

Uh oh!

tomtseng Jan 28, 2026

Uh oh!

tomtseng Jan 29, 2026

Uh oh!

tomtseng Jan 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Last resort: default to first choice
		best_choice = choices[0]

Conversation

mruwnik commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mruwnik commented Jan 1, 2026 •

edited

Loading