evals: Add rubric-based StrongREJECT by tomtseng · Pull Request #96 · criticalml-uw/TamperBench

tomtseng · 2026-02-14T00:30:05Z

Changes

We ran the rubric-based StrongREJECT for the paper but never committed it. Here's Claude Code's attempt at an implementation

Probably I should close this PR though, I'm assuming Saad has an implementation he used for the StrongREJECT rubric results

Integrate the rubric-based StrongREJECT evaluator (strongreject_rubric from the strong_reject package) as the canonical StrongRejectEvaluation, registered under EvalName.STRONG_REJECT
Rename the existing fine-tuned classifier to StrongRejectFinetunedEvaluation (EvalName.STRONG_REJECT_FINETUNED) — the fine-tuned model was trained on labels from the rubric judge, so the rubric version is the more authoritative
scorer
Extract shared inference/prompt-loading logic into _StrongRejectBaseEvaluation to avoid duplication between the two variants

Testing

pytest tests/evals/test_strong_reject.py — rubric test mocks the API so no OpenAI key needed; finetuned test requires GPU
Verify attack tests still reference STRONG_REJECT_FINETUNED correctly (all are @pytest.mark.expensive / GPU-only)
Verify scripts/validate_references.py --list still prints StrongREJECT references

sdhossain · 2026-02-14T00:40:13Z

@tomtseng in case if it is of any help, here is the code I used for strongreject rubric: https://github.com/criticalml-uw/TamperBench/tree/sh/cais_checkpoint/scripts/strong_reject_rubric_backup

The main two files of concern are:

But there are other files in that directory that are relevant (like converting the embedding attack to also use strongreject, etc. which isn't an extremely elegant implementation, but kind of stand-alone scripts)

sdhossain · 2026-02-14T00:44:14Z

src/tamperbench/whitebox/evals/strong_reject/strong_reject.py

+class StrongRejectFinetunedEvaluation(_StrongRejectBaseEvaluation[S]):
+    """StrongREJECT Evaluation class using the fine-tuned classifier."""
+
+    name: EvalName = EvalName.STRONG_REJECT_FINETUNED


idea: maybe we can have them as same names, but different metrics?

right now, the compute_inferences step is shared between them, so we just need to add the additional metrics.

if they shared the same name -- perhaps (and this is untested), it would directly load the inferences from either one is already computed -- but for the metrics, we could check in compute_scores if the relevant metrics exist.

sdhossain · 2026-02-14T00:44:40Z

src/tamperbench/whitebox/evals/strong_reject/strong_reject.py

@@ -87,54 +144,65 @@ def compute_scores(self, inferences: DataFrame[InferenceSchema]) -> DataFrame[Sc
        inferences_dataframe: DataFrame[InferenceSchema] = InferenceSchema.validate(inferences)


we could for example check here (And vice-versa for the other, if we had different metric names)

tomtseng added 2 commits February 13, 2026 16:28

evals: Add rubric-based StrongREJECT

f9a0f32

strong_reject: Fix configs

8cd34b2

sdhossain reviewed Feb 14, 2026

View reviewed changes

strong_reject: Document

bf841a1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evals: Add rubric-based StrongREJECT#96

evals: Add rubric-based StrongREJECT#96
tomtseng wants to merge 3 commits intomainfrom
tomtseng/sr-rubric

tomtseng commented Feb 14, 2026 •

edited

Loading

Uh oh!

sdhossain commented Feb 14, 2026 •

edited

Loading

Uh oh!

sdhossain Feb 14, 2026

Uh oh!

sdhossain Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -87,54 +144,65 @@ def compute_scores(self, inferences: DataFrame[InferenceSchema]) -> DataFrame[Sc
		inferences_dataframe: DataFrame[InferenceSchema] = InferenceSchema.validate(inferences)

Conversation

tomtseng commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Testing

Uh oh!

sdhossain commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sdhossain Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

sdhossain Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomtseng commented Feb 14, 2026 •

edited

Loading

sdhossain commented Feb 14, 2026 •

edited

Loading