Conversation
| """ | ||
|
|
||
| judge_model: str = "gpt-4o" | ||
| judge_model: str = "gpt-4-0613" # Match reference: WMDP paper uses GPT-4-0613 |
There was a problem hiding this comment.
To clarify: this is saying, the WMDP paper's use of MT-bench uses GPT-4-0613?
I do also wonder whether we should still just stick with 4o, it's cheaper, and while I'm having trouble finding information on how gpt-4-0613 performs on benchmarks, I'm under the impression that gpt-4o performs similarly to gpt-4
|
|
||
| Args: | ||
| eval_config: Configuration for the evaluation. | ||
| """ |
There was a problem hiding this comment.
hmm, why did the Args: and Returns: get removed from all the docstrings throughout the file? is it that the superclass already documents the params?
same question in strong_reject.py as well
| """Configuration for Policy-Oriented Safety Evaluation. | ||
|
|
||
| Attributes: | ||
| judge_model: Model to use for LLM-as-a-Judge scoring (default: gpt-4o). |
There was a problem hiding this comment.
file header docstring says we use GPT-4, as does the class docstring on class PolicyEvaluation and function docstring compute_scores(), but here we actually use gpt-4o.
update headers to say 4o instead? Or just don't reference the specific GPT model in the docstrings, since it's configurable, just say we use an LLM-as-judge, and the specific model can be documented here in the PolicyEvaluationConfig
| ) | ||
| content = response.choices[0].message.content or "" | ||
| match = re.search(score_pattern, content, re.IGNORECASE) | ||
| score = float(match.group(1)) if match else default_score |
There was a problem hiding this comment.
I wonder if we should log some kind of judge failure rather than silently giving the default_score. Then whoever is processing the data can decide what to do with the failures — either filter out the failures, or replace them with some default score.
| # Last resort: default to first choice | ||
| best_choice = choices[0] |
There was a problem hiding this comment.
similarly to judge failures, I wonder if we should log failures explicitly and leave it to the analysis afterwards to determine what to do with them. Though I do think defaulting to random guessing is probably reasonable
| self, scores: DataFrame[ScoreSchema] | ||
| ) -> DataFrame[EvaluationSchema]: | ||
| """Aggregate scores into refusal rate metrics.""" | ||
| scores_df = scores.to_pandas() |
There was a problem hiding this comment.
is computing to pandas necessary? My impression is that polars is meant to mostly replace pandas
| This evaluation implements the policy-oriented safety benchmark from Qi et al. (2023), | ||
| which measures harmfulness of model outputs using GPT-4 as a judge. | ||
|
|
||
| The benchmark consists of 330 harmful instructions across 11 categories based on |
There was a problem hiding this comment.
I believe this is https://huggingface.co/datasets/LLM-Tuning-Safety/HEx-PHI, though I grepped for HEx-PHI in the codebase and didn't see it — where is the dataset loaded?
| return ScoreSchema.validate(pl.DataFrame(scores)) | ||
|
|
||
| # Use LLM judge | ||
| def parse_class(content: str) -> float: |
| config = WMDPEvaluationConfig( | ||
| model_checkpoint=model_id, | ||
| out_dir=out_dir, | ||
| max_generation_length=1, |
There was a problem hiding this comment.
I wonder if we can set the default max_generation_length to be 1 in WMDPEvaluationConfig since so long as WMDP is using logprobs, we always want to set max_generation_length=1. Not sure if that's actually possible though, can a subclass set a default value to a parent class's arg? So maybe no action needed here.
Changes
Implements the following evaluations:
WMDP
Bio-chem-cyber-propensity
Policy orientated benchmark
XS Test
This also extracts some common generation/metrics functions and changes stuff to use vllm
Testing
I extracted the model and reported result from each evaluation paper, then ran that eval on that model to see if the result is the same(ish). I was somewhat stricter than needed, below are the results: