-
Notifications
You must be signed in to change notification settings - Fork 88
feat(evaluation): add VLMMetrics #545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
7e69a4c
0591c06
da02aff
8ac03ce
795007b
3a08ab4
35b84f8
ad0de23
8129fd2
9b7e8ce
3f6c4be
6a7fad5
c793c6c
182c279
c529854
63d106a
e02e20f
b596762
697081e
d99b315
53c08bc
1365174
a045a38
101a6d3
015af72
20f59c9
bc1eeee
e54a9c2
f01eee8
1171de6
fbd88f7
ba8af2c
970c9ec
49f898e
6f17bb7
f6ea2be
982c78b
f1d0d73
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -321,6 +321,5 @@ def question_answering_collate( | |
| "image_classification_collate": image_classification_collate, | ||
| "text_generation_collate": text_generation_collate, | ||
| "question_answering_collate": question_answering_collate, | ||
| "prompt_collate": prompt_collate, | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we erase this here? I believe the prompt collate is still in this file. I think we should either keep this even if it's no longer used, for future prompt only datasets, or also remove the prompt_collate function above. |
||
| "prompt_with_auxiliaries_collate": prompt_with_auxiliaries_collate, | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -135,7 +135,7 @@ def from_string( | |
| tokenizer: AutoTokenizer | None = None, | ||
| collate_fn_args: dict = dict(), | ||
| dataloader_args: dict = dict(), | ||
| seed: int = 42, | ||
| seed: int | None = None, | ||
| category: str | list[str] | None = None, | ||
| fraction: float = 1.0, | ||
| train_sample_size: int | None = None, | ||
|
|
@@ -154,8 +154,10 @@ def from_string( | |
| Any additional arguments for the collate function. | ||
| dataloader_args : dict | ||
| Any additional arguments for the dataloader. | ||
| seed : int | ||
| The seed to use. | ||
| seed : int | None, optional | ||
| Passed to dataset setup when the loader uses shuffled sampling. | ||
| If None, setups that require a seed default to 42; test-only benchmarks | ||
| omit seed so ordering stays deterministic without warnings. | ||
| category : str | list[str] | None | ||
| The category of the dataset. | ||
| fraction : float | ||
|
|
@@ -177,7 +179,12 @@ def from_string( | |
| collate_fn_args = default_collate_fn_args | ||
|
|
||
| if "seed" in inspect.signature(setup_fn).parameters: | ||
| setup_fn = partial(setup_fn, seed=seed) | ||
| seed_param = inspect.signature(setup_fn).parameters["seed"] | ||
| has_default = seed_param.default is not inspect.Parameter.empty | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just asking so I understood correctly, |
||
| if seed is not None: | ||
| setup_fn = partial(setup_fn, seed=seed) | ||
| elif not has_default: | ||
| setup_fn = partial(setup_fn, seed=42) | ||
|
|
||
| if "category" in inspect.signature(setup_fn).parameters: | ||
| setup_fn = partial(setup_fn, category=category) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@begumcig do you feel it is a good point to start metrics specifically related to evals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I am not really sure actually, do you think we should have one big evaluation dependency, or group them with respect to the benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could also be an option, but I also feel that the eval group per benchmark only makes sense if we also loosen the rest of the dependencies per algorithm. For now, we ca perhaps just add them to the global overview and perhaps do the seperation in v1 of Pruna?