Skip to content

evals: Add rubric-based StrongREJECT#96

Draft
tomtseng wants to merge 3 commits intomainfrom
tomtseng/sr-rubric
Draft

evals: Add rubric-based StrongREJECT#96
tomtseng wants to merge 3 commits intomainfrom
tomtseng/sr-rubric

Conversation

@tomtseng
Copy link
Collaborator

@tomtseng tomtseng commented Feb 14, 2026

Changes

We ran the rubric-based StrongREJECT for the paper but never committed it. Here's Claude Code's attempt at an implementation

Probably I should close this PR though, I'm assuming Saad has an implementation he used for the StrongREJECT rubric results

  • Integrate the rubric-based StrongREJECT evaluator (strongreject_rubric from the strong_reject package) as the canonical StrongRejectEvaluation, registered under EvalName.STRONG_REJECT
  • Rename the existing fine-tuned classifier to StrongRejectFinetunedEvaluation (EvalName.STRONG_REJECT_FINETUNED) — the fine-tuned model was trained on labels from the rubric judge, so the rubric version is the more authoritative
    scorer
  • Extract shared inference/prompt-loading logic into _StrongRejectBaseEvaluation to avoid duplication between the two variants

Testing

  • pytest tests/evals/test_strong_reject.py — rubric test mocks the API so no OpenAI key needed; finetuned test requires GPU
  • Verify attack tests still reference STRONG_REJECT_FINETUNED correctly (all are @pytest.mark.expensive / GPU-only)
  • Verify scripts/validate_references.py --list still prints StrongREJECT references

@sdhossain
Copy link
Collaborator

sdhossain commented Feb 14, 2026

@tomtseng in case if it is of any help, here is the code I used for strongreject rubric: https://github.com/criticalml-uw/TamperBench/tree/sh/cais_checkpoint/scripts/strong_reject_rubric_backup

The main two files of concern are:

But there are other files in that directory that are relevant (like converting the embedding attack to also use strongreject, etc. which isn't an extremely elegant implementation, but kind of stand-alone scripts)

class StrongRejectFinetunedEvaluation(_StrongRejectBaseEvaluation[S]):
"""StrongREJECT Evaluation class using the fine-tuned classifier."""

name: EvalName = EvalName.STRONG_REJECT_FINETUNED
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea: maybe we can have them as same names, but different metrics?

right now, the compute_inferences step is shared between them, so we just need to add the additional metrics.

if they shared the same name -- perhaps (and this is untested), it would directly load the inferences from either one is already computed -- but for the metrics, we could check in compute_scores if the relevant metrics exist.

@@ -87,54 +144,65 @@ def compute_scores(self, inferences: DataFrame[InferenceSchema]) -> DataFrame[Sc
inferences_dataframe: DataFrame[InferenceSchema] = InferenceSchema.validate(inferences)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could for example check here (And vice-versa for the other, if we had different metric names)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants