eval: GPQA Evaluation by MKowal2 · Pull Request #30 · criticalml-uw/TamperBench

MKowal2 · 2025-09-22T02:12:41Z

GPQA Eval

This PR adds the GPQA eval to the suite of evaluations and is added to thefull_parameter_finetune attack. This dataset is comprised of 448 difficult multiple choice questions in biology, physics, and chemistry. A 'diamond' subset of questions is commonly used as the main benchmark and is comprised of a high-quality and extra difficult 198 questions from the main dataset. NOTE: While they offer few-shot prompting methods, these perform worse than the zero-shot approach (likely due to models being confused by being in continuation vs. chat mode). Indeed, the original paper (appendix table 6) also shows models obtain equal or worse performance when using the few-shot prompting. Therefore, the default and recommended mode is to use zero-shot prompting, however, I have left the other prompting types in the repository in case anyone wants to use these.

Testing

Added tests (under tests/evals/test_gpqa.py) for 50 questions for Llama 3.1 8b. In addition, I manually ran and checked the results for Llama 3.1-8B and Qwen 2.5 32B and they lined up to expected results on the main and diamond question set.

Todo Notes

Added some pyright ignores that we need to fix later
The dataset is password protected at this link, but not sure how we want to host it. They request that you do not store the dataset directly so that web-scrappers will not find it. We currently have the pre-commit checking large files (the dataset is 2.5 MB) so we could change this and directly upload or point to the dataset to be installed manually (I didn't want to change it myself without checking). Alternatively, they host the dataset on huggingface so we could also implement an automatic download. LMK what you think the best method for hosting is.

…o attack

tomtseng

looks great! I didn't look in full detail since, as discussed in the project meeting yesterday, this may be superseded by lm-eval

tomtseng · 2025-10-15T14:33:27Z

tests/evals/test_gpqa.py

+if __name__ == "__main__":
+    load_dotenv()  # ensure HF_TOKEN available
+
+    # with tempfile.TemporaryDirectory() as tmpdirname:


Suggested change

# with tempfile.TemporaryDirectory() as tmpdirname:

tomtseng · 2025-10-15T14:49:19Z

src/safetunebed/whitebox/evals/GPQA/GPQA.py

+            r"answer \((.)\)",
+            r"\((.)\)",
+        ]  # note: misses cases where the answer is not in the format of
+        # the pattern and sometimes grabs the repeated examples instead


what does "repeated examples" here, does it mean that the model doesn't follow instructions and instead repeats some of the few-shot examples or prompt?

tomtseng · 2025-10-15T14:51:32Z

src/safetunebed/whitebox/evals/GPQA/GPQA.py

+                for j, (prompt, input_id, output_ids) in enumerate(
+                    zip(batch_prompts, input_ids, reasoning_outputs, strict=False)
+                ):
+                    # Skip question 69


there was some other special question-69 handling earlier above, i wonder if we can remove the question in such a way that we don't have to handle it in several places

tomtseng · 2025-10-15T15:35:52Z

src/safetunebed/whitebox/evals/GPQA/GPQA.py

+                    ).strip()
+
+                    inferences["prompt"].append(prompt)
+                    inferences["response"].append(response)


some of this code looks similar to the "zero_shot_chain_of_thought" case, am wondering if can split into a helper function to avoid repeating code

tomtseng · 2025-10-15T15:37:09Z

src/safetunebed/whitebox/evals/GPQA/GPQA.py

+        if len(self.examples) > 69:
+            example_69 = self.examples[69]
+            if any(
+                len(choice) > 1000


why 1000, is that b/c we're assuming max token length of 1024?

tomtseng · 2025-10-15T15:39:18Z

src/safetunebed/whitebox/evals/GPQA/GPQA.py

+                        input_ids=final_input_ids,
+                        attention_mask=final_attention_mask,
+                        max_new_tokens=100,
+                        do_sample=False,  # Greedy decoding for final answer


why is only the final answer in this case greedy decoding (whereas the first step of the zero-shot process & the non-zero-shot case don't do greedy decoding)?

tomtseng · 2025-10-15T15:59:49Z

The dataset is password protected at this link, but not sure how we want to host it.

if it's already on huggingface then downloading from huggingface seems fine!

sdhossain

Very nicely done!

sdhossain · 2025-10-13T15:40:28Z

src/safetunebed/whitebox/evals/GPQA/GPQA.py

+
+    name: EvalName = EvalName.GPQA
+    objective: MetricName = MetricName.GPQA_SCORE
+    attacker_direction: OptimizationDirection = OptimizationDirection.MINIMIZE


I believe an attacker would also want to maximize performance on the evaluation

Suggested change

attacker_direction: OptimizationDirection = OptimizationDirection.MINIMIZE

attacker_direction: OptimizationDirection = OptimizationDirection.MAXIMIZE

sdhossain · 2025-10-13T15:41:15Z

src/safetunebed/whitebox/evals/GPQA/GPQA.py

+
+    def get_dataset_path(self) -> str:
+        """Get the path to the GPQA dataset CSV file based on the configured split."""
+        base_dir = os.path.dirname(__file__)


nit: personal preference but I prefer pathlib to os but - feel free to disregard this comment.

GPQA Eval Implementation

483148c

MKowal2 added the evaluation Adds or modifies evaluation label Sep 22, 2025

Add proper HF chat template, set default eval split to diamond, add t…

db489d7

…o attack

MKowal2 marked this pull request as ready for review October 7, 2025 16:50

MKowal2 requested review from sdhossain and tomtseng October 7, 2025 16:50

sdhossain changed the title ~~GPQA Evaluation~~ eval: GPQA Evaluation Oct 7, 2025

tomtseng approved these changes Oct 15, 2025

View reviewed changes

sdhossain approved these changes Oct 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval: GPQA Evaluation#30

eval: GPQA Evaluation#30
MKowal2 wants to merge 2 commits intomainfrom
mk/gpqa

MKowal2 commented Sep 22, 2025 •

edited

Loading

Uh oh!

tomtseng left a comment •

edited

Loading

Uh oh!

tomtseng Oct 15, 2025

Uh oh!

tomtseng Oct 15, 2025

Uh oh!

tomtseng Oct 15, 2025

Uh oh!

tomtseng Oct 15, 2025

Uh oh!

tomtseng Oct 15, 2025

Uh oh!

tomtseng Oct 15, 2025

Uh oh!

tomtseng commented Oct 15, 2025

Uh oh!

sdhossain left a comment •

edited

Loading

Uh oh!

sdhossain Oct 13, 2025

Uh oh!

sdhossain Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	attacker_direction: OptimizationDirection = OptimizationDirection.MINIMIZE
	attacker_direction: OptimizationDirection = OptimizationDirection.MAXIMIZE

Conversation

MKowal2 commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GPQA Eval

Testing

Todo Notes

Uh oh!

tomtseng left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomtseng commented Oct 15, 2025

Uh oh!

sdhossain left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MKowal2 commented Sep 22, 2025 •

edited

Loading

tomtseng left a comment •

edited

Loading

sdhossain left a comment •

edited

Loading