Skip to content

eval: GPQA Evaluation#30

Open
MKowal2 wants to merge 2 commits intomainfrom
mk/gpqa
Open

eval: GPQA Evaluation#30
MKowal2 wants to merge 2 commits intomainfrom
mk/gpqa

Conversation

@MKowal2
Copy link
Collaborator

@MKowal2 MKowal2 commented Sep 22, 2025

GPQA Eval

This PR adds the GPQA eval to the suite of evaluations and is added to thefull_parameter_finetune attack. This dataset is comprised of 448 difficult multiple choice questions in biology, physics, and chemistry. A 'diamond' subset of questions is commonly used as the main benchmark and is comprised of a high-quality and extra difficult 198 questions from the main dataset. NOTE: While they offer few-shot prompting methods, these perform worse than the zero-shot approach (likely due to models being confused by being in continuation vs. chat mode). Indeed, the original paper (appendix table 6) also shows models obtain equal or worse performance when using the few-shot prompting. Therefore, the default and recommended mode is to use zero-shot prompting, however, I have left the other prompting types in the repository in case anyone wants to use these.

Testing

Added tests (under tests/evals/test_gpqa.py) for 50 questions for Llama 3.1 8b. In addition, I manually ran and checked the results for Llama 3.1-8B and Qwen 2.5 32B and they lined up to expected results on the main and diamond question set.

Todo Notes

  • Added some pyright ignores that we need to fix later
  • The dataset is password protected at this link, but not sure how we want to host it. They request that you do not store the dataset directly so that web-scrappers will not find it. We currently have the pre-commit checking large files (the dataset is 2.5 MB) so we could change this and directly upload or point to the dataset to be installed manually (I didn't want to change it myself without checking). Alternatively, they host the dataset on huggingface so we could also implement an automatic download. LMK what you think the best method for hosting is.

@MKowal2 MKowal2 added the evaluation Adds or modifies evaluation label Sep 22, 2025
@MKowal2 MKowal2 marked this pull request as ready for review October 7, 2025 16:50
@MKowal2 MKowal2 requested review from sdhossain and tomtseng October 7, 2025 16:50
@sdhossain sdhossain changed the title GPQA Evaluation eval: GPQA Evaluation Oct 7, 2025
Copy link
Collaborator

@tomtseng tomtseng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! I didn't look in full detail since, as discussed in the project meeting yesterday, this may be superseded by lm-eval

if __name__ == "__main__":
load_dotenv() # ensure HF_TOKEN available

# with tempfile.TemporaryDirectory() as tmpdirname:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# with tempfile.TemporaryDirectory() as tmpdirname:

r"answer \((.)\)",
r"\((.)\)",
] # note: misses cases where the answer is not in the format of
# the pattern and sometimes grabs the repeated examples instead
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "repeated examples" here, does it mean that the model doesn't follow instructions and instead repeats some of the few-shot examples or prompt?

for j, (prompt, input_id, output_ids) in enumerate(
zip(batch_prompts, input_ids, reasoning_outputs, strict=False)
):
# Skip question 69
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was some other special question-69 handling earlier above, i wonder if we can remove the question in such a way that we don't have to handle it in several places

).strip()

inferences["prompt"].append(prompt)
inferences["response"].append(response)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of this code looks similar to the "zero_shot_chain_of_thought" case, am wondering if can split into a helper function to avoid repeating code

if len(self.examples) > 69:
example_69 = self.examples[69]
if any(
len(choice) > 1000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 1000, is that b/c we're assuming max token length of 1024?

input_ids=final_input_ids,
attention_mask=final_attention_mask,
max_new_tokens=100,
do_sample=False, # Greedy decoding for final answer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is only the final answer in this case greedy decoding (whereas the first step of the zero-shot process & the non-zero-shot case don't do greedy decoding)?

@tomtseng
Copy link
Collaborator

The dataset is password protected at this link, but not sure how we want to host it.

if it's already on huggingface then downloading from huggingface seems fine!

Copy link
Collaborator

@sdhossain sdhossain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nicely done!


name: EvalName = EvalName.GPQA
objective: MetricName = MetricName.GPQA_SCORE
attacker_direction: OptimizationDirection = OptimizationDirection.MINIMIZE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe an attacker would also want to maximize performance on the evaluation

Suggested change
attacker_direction: OptimizationDirection = OptimizationDirection.MINIMIZE
attacker_direction: OptimizationDirection = OptimizationDirection.MAXIMIZE


def get_dataset_path(self) -> str:
"""Get the path to the GPQA dataset CSV file based on the configured split."""
base_dir = os.path.dirname(__file__)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: personal preference but I prefer pathlib to os but - feel free to disregard this comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

evaluation Adds or modifies evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants