Conversation
| if __name__ == "__main__": | ||
| load_dotenv() # ensure HF_TOKEN available | ||
|
|
||
| # with tempfile.TemporaryDirectory() as tmpdirname: |
There was a problem hiding this comment.
| # with tempfile.TemporaryDirectory() as tmpdirname: |
| r"answer \((.)\)", | ||
| r"\((.)\)", | ||
| ] # note: misses cases where the answer is not in the format of | ||
| # the pattern and sometimes grabs the repeated examples instead |
There was a problem hiding this comment.
what does "repeated examples" here, does it mean that the model doesn't follow instructions and instead repeats some of the few-shot examples or prompt?
| for j, (prompt, input_id, output_ids) in enumerate( | ||
| zip(batch_prompts, input_ids, reasoning_outputs, strict=False) | ||
| ): | ||
| # Skip question 69 |
There was a problem hiding this comment.
there was some other special question-69 handling earlier above, i wonder if we can remove the question in such a way that we don't have to handle it in several places
| ).strip() | ||
|
|
||
| inferences["prompt"].append(prompt) | ||
| inferences["response"].append(response) |
There was a problem hiding this comment.
some of this code looks similar to the "zero_shot_chain_of_thought" case, am wondering if can split into a helper function to avoid repeating code
| if len(self.examples) > 69: | ||
| example_69 = self.examples[69] | ||
| if any( | ||
| len(choice) > 1000 |
There was a problem hiding this comment.
why 1000, is that b/c we're assuming max token length of 1024?
| input_ids=final_input_ids, | ||
| attention_mask=final_attention_mask, | ||
| max_new_tokens=100, | ||
| do_sample=False, # Greedy decoding for final answer |
There was a problem hiding this comment.
why is only the final answer in this case greedy decoding (whereas the first step of the zero-shot process & the non-zero-shot case don't do greedy decoding)?
if it's already on huggingface then downloading from huggingface seems fine! |
|
|
||
| name: EvalName = EvalName.GPQA | ||
| objective: MetricName = MetricName.GPQA_SCORE | ||
| attacker_direction: OptimizationDirection = OptimizationDirection.MINIMIZE |
There was a problem hiding this comment.
I believe an attacker would also want to maximize performance on the evaluation
| attacker_direction: OptimizationDirection = OptimizationDirection.MINIMIZE | |
| attacker_direction: OptimizationDirection = OptimizationDirection.MAXIMIZE |
|
|
||
| def get_dataset_path(self) -> str: | ||
| """Get the path to the GPQA dataset CSV file based on the configured split.""" | ||
| base_dir = os.path.dirname(__file__) |
There was a problem hiding this comment.
nit: personal preference but I prefer pathlib to os but - feel free to disregard this comment.
GPQA Eval
This PR adds the GPQA eval to the suite of evaluations and is added to the
full_parameter_finetuneattack. This dataset is comprised of 448 difficult multiple choice questions in biology, physics, and chemistry. A 'diamond' subset of questions is commonly used as the main benchmark and is comprised of a high-quality and extra difficult 198 questions from the main dataset. NOTE: While they offer few-shot prompting methods, these perform worse than the zero-shot approach (likely due to models being confused by being in continuation vs. chat mode). Indeed, the original paper (appendix table 6) also shows models obtain equal or worse performance when using the few-shot prompting. Therefore, the default and recommended mode is to use zero-shot prompting, however, I have left the other prompting types in the repository in case anyone wants to use these.Testing
Added tests (under
tests/evals/test_gpqa.py) for 50 questions for Llama 3.1 8b. In addition, I manually ran and checked the results for Llama 3.1-8B and Qwen 2.5 32B and they lined up to expected results on the main and diamond question set.Todo Notes