Conversation
| "--device", self.eval_config.device, | ||
| "--batch_size", str(self.eval_config.batch_size), | ||
| ] | ||
|
|
There was a problem hiding this comment.
there's probably some way in this function we could get the output json and parse the outputs to a json
There was a problem hiding this comment.
pubmedqa_llama3_qlora_2025-10-10T15-53-15.177293.json
Example of an output json^ generated by lm_eval. We can parse this further, if required - @sdhossain
There was a problem hiding this comment.
In our format we'd need a bit more information - i.e. what each datapoint was - the metric score that it has, etc.
some for datapoint 1:
we'd need the exact input to the model, the model's exact output, and the evaluation systems score.
I've attached an example csv we have for another evaluation (we store it as a .parquet - see StrongREJECT as an example)
Right now - (from what I saw in the json, I mainly saw configs, and an aggregate score). This would perhaps make it difficult to do further analysis, i.e. if we wanted to check responses using an LLM-Judge, etc.
MKowal2
left a comment
There was a problem hiding this comment.
LGTM! Just left a couple nits and questions I had about the implementation. Also - nice find and update to the logging!
|
|
||
| import subprocess | ||
| from dataclasses import dataclass | ||
| from typing import List, Dict, Union, TypeVar |
There was a problem hiding this comment.
nit: as of Python 3.9, can use dict and list in type hints rather than importing Dict and List
| eval_config = LMRunEvaluationConfig( | ||
| tasks=["hellaswag"], | ||
| pretrained_path="/model-weights/Qwen2.5-3B", | ||
| tokenizer_path="/model-weights/Qwen2.5-3B", |
There was a problem hiding this comment.
Just wondering because I am curious: when would the pretrained_path and tokenizer_path be different?
| if __name__ == "__main__": | ||
| load_dotenv() | ||
|
|
||
| with tempfile.TemporaryDirectory() as tmpdirname: |
There was a problem hiding this comment.
Nit: Do we need the tempfile directory if we are testing print-only? Probably makes sense to keep if we incorporate the logging and test that too though, which seems like a good idea.
| Returns DataFrame response=stdout. | ||
| """ | ||
|
|
||
| name = "LM_EVAL_RUNNER" |
There was a problem hiding this comment.
For the other evals (e.g., strong reject) we set the name, score, and optimization direction like:
from safetunebed.whitebox.utils import (
EvalName,
MetricName,
OptimizationDirection,
dealloc_model_and_tokenizer
)
...
name: EvalName = EvalName.StrongReject
objective: MetricName = MetricName.STRONGREJECT_SCORE
attacker_direction: OptimizationDirection = OptimizationDirection.MAXIMIZE
defender_direction: OptimizationDirection = OptimizationDirection.MINIMIZE
Not sure if this should be here as well - in terms of naming convention consistency, but also, there are a bunch of different tasks that could have different optimization directions, although I could also imagine they all share the same direction so we could include this here as well.
| max_generation_length: int = 0 # required by base | ||
|
|
||
|
|
||
| class LMRunEvaluation(WhiteBoxEvaluation[S]): |
There was a problem hiding this comment.
I think you don't need to define S = TypeVar(...) and you could just do LMRunEvaluation(WhiteBoxEvaluation[LMRunEvaluationConfig]). Although it looks like we do this TypeVar thing throughout the code so maybe cleaning this up should be a separate PR
| print(f"\n=== Running task: {task} ===") | ||
| logs.append(self._run_single_task(task)) | ||
|
|
||
| return pl.DataFrame(logs) #Inference schema doesn't really make sense until now for me. |
There was a problem hiding this comment.
remove comment? or elaborate on comment about what's confuising?

Changes
Summarize the changes in this PR and describe the context or motivation for
them.
Add a title, prepending the tag [attack], [defense], [evaluation], or [infra] if
appropriate.
Testing
Describe how you tested the changes in this PR. E.g., added tests, or ran
command
fooand checked the results looked good.