Skip to content

eval: added lm-eval#36

Open
psyonp wants to merge 6 commits intomainfrom
psyonp/lm-eval
Open

eval: added lm-eval#36
psyonp wants to merge 6 commits intomainfrom
psyonp/lm-eval

Conversation

@psyonp
Copy link
Collaborator

@psyonp psyonp commented Oct 14, 2025

Changes

Summarize the changes in this PR and describe the context or motivation for
them.

Add a title, prepending the tag [attack], [defense], [evaluation], or [infra] if
appropriate.

Testing

Describe how you tested the changes in this PR. E.g., added tests, or ran
command foo and checked the results looked good.

@psyonp psyonp added the evaluation Adds or modifies evaluation label Oct 14, 2025
"--device", self.eval_config.device,
"--batch_size", str(self.eval_config.batch_size),
]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's probably some way in this function we could get the output json and parse the outputs to a json

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2025-10-21 at 11 51 50 PM

I believe we just need to add a --output_path “${OUTPUT_PATH}” arg to save the output jsons to the specified path. They already handle saving and caching results.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pubmedqa_llama3_qlora_2025-10-10T15-53-15.177293.json

Example of an output json^ generated by lm_eval. We can parse this further, if required - @sdhossain

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@sdhossain sdhossain Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our format we'd need a bit more information - i.e. what each datapoint was - the metric score that it has, etc.

some for datapoint 1:

we'd need the exact input to the model, the model's exact output, and the evaluation systems score.

I've attached an example csv we have for another evaluation (we store it as a .parquet - see StrongREJECT as an example)

evaluator_scores.csv

Right now - (from what I saw in the json, I mainly saw configs, and an aggregate score). This would perhaps make it difficult to do further analysis, i.e. if we wanted to check responses using an LLM-Judge, etc.

@MKowal2 MKowal2 self-requested a review October 23, 2025 18:41
Copy link
Collaborator

@MKowal2 MKowal2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just left a couple nits and questions I had about the implementation. Also - nice find and update to the logging!


import subprocess
from dataclasses import dataclass
from typing import List, Dict, Union, TypeVar
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: as of Python 3.9, can use dict and list in type hints rather than importing Dict and List

eval_config = LMRunEvaluationConfig(
tasks=["hellaswag"],
pretrained_path="/model-weights/Qwen2.5-3B",
tokenizer_path="/model-weights/Qwen2.5-3B",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering because I am curious: when would the pretrained_path and tokenizer_path be different?

if __name__ == "__main__":
load_dotenv()

with tempfile.TemporaryDirectory() as tmpdirname:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Do we need the tempfile directory if we are testing print-only? Probably makes sense to keep if we incorporate the logging and test that too though, which seems like a good idea.

Returns DataFrame response=stdout.
"""

name = "LM_EVAL_RUNNER"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the other evals (e.g., strong reject) we set the name, score, and optimization direction like:

from safetunebed.whitebox.utils import (
    EvalName,
    MetricName,
    OptimizationDirection,
    dealloc_model_and_tokenizer
)
...
    name: EvalName = EvalName.StrongReject
    objective: MetricName = MetricName.STRONGREJECT_SCORE
    attacker_direction: OptimizationDirection = OptimizationDirection.MAXIMIZE
    defender_direction: OptimizationDirection = OptimizationDirection.MINIMIZE

Not sure if this should be here as well - in terms of naming convention consistency, but also, there are a bunch of different tasks that could have different optimization directions, although I could also imagine they all share the same direction so we could include this here as well.

max_generation_length: int = 0 # required by base


class LMRunEvaluation(WhiteBoxEvaluation[S]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need to define S = TypeVar(...) and you could just do LMRunEvaluation(WhiteBoxEvaluation[LMRunEvaluationConfig]). Although it looks like we do this TypeVar thing throughout the code so maybe cleaning this up should be a separate PR

print(f"\n=== Running task: {task} ===")
logs.append(self._run_single_task(task))

return pl.DataFrame(logs) #Inference schema doesn't really make sense until now for me.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment? or elaborate on comment about what's confuising?

@tomtseng tomtseng mentioned this pull request Jan 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

evaluation Adds or modifies evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants