Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
7e69a4c
feat(evaluation): add VLM-based metrics with litellm and transformers…
davidberenstein1957 Feb 21, 2026
0591c06
fix(evaluation): ARNIQA not in torchmetrics - implement manually
davidberenstein1957 Feb 21, 2026
da02aff
fix(evaluation): use List-based scores pattern matching Pruna standards
davidberenstein1957 Feb 21, 2026
8ac03ce
fix(evaluation): use sync completion instead of async acompletion
davidberenstein1957 Feb 21, 2026
795007b
chore(evaluation): remove ARNIQA from VLM PR - has dedicated PR #547
davidberenstein1957 Feb 21, 2026
3a08ab4
feat(evaluation): add structured generation to VLM metrics
davidberenstein1957 Feb 21, 2026
35b84f8
fix(evaluation): fix linting issues in VLM metrics
davidberenstein1957 Feb 21, 2026
ad0de23
fix(evaluation): fix remaining linting issues
davidberenstein1957 Feb 21, 2026
8129fd2
fix(evaluation): fix D205 docstring issues in VLM classes
davidberenstein1957 Feb 21, 2026
9b7e8ce
fix(evaluation): fix import sorting in __init__.py
davidberenstein1957 Feb 21, 2026
3f6c4be
fix(evaluation): skip docstring check for metrics_vlm
davidberenstein1957 Feb 21, 2026
6a7fad5
fix(evaluation): enhance docstrings for VLM metrics and base classes
davidberenstein1957 Feb 21, 2026
c793c6c
feat(evaluation): introduce new VLM metrics and integration tests
davidberenstein1957 Feb 27, 2026
182c279
Delete docs/VLM_METRICS_PROMPT_COMPARISON.md
davidberenstein1957 Feb 27, 2026
c529854
feat(metrics): paper docstring fixes, VQA use_probability default, vl…
davidberenstein1957 Mar 5, 2026
63d106a
feat(metrics): enhance metric classes with update and compute docstrings
davidberenstein1957 Mar 5, 2026
e02e20f
fix(vlm_base): update response_format type hints for clarity
davidberenstein1957 Mar 5, 2026
b596762
refactor(vlm_base): simplify response_format check for pydantic usage
davidberenstein1957 Mar 5, 2026
697081e
fix(vlm_base): add "json" option to response_format type hints
davidberenstein1957 Mar 5, 2026
d99b315
feat(dependencies): add pruna[evaluation] to dev dependencies
davidberenstein1957 Mar 5, 2026
53c08bc
refactor(metrics): improve docstring consistency and formatting acros…
davidberenstein1957 Mar 5, 2026
1365174
refactor(metrics): update response formats and improve utility functions
davidberenstein1957 Mar 12, 2026
a045a38
refactor(metrics): update collation functions and enhance benchmark t…
davidberenstein1957 Mar 17, 2026
101a6d3
refactor(data): update seed parameter handling and add warnings for t…
davidberenstein1957 Mar 19, 2026
015af72
Fix VLM metric structured output
davidberenstein1957 Mar 24, 2026
20f59c9
Merge pull request #594 from PrunaAI/davidberenstein1957/vlm-metrics-…
davidberenstein1957 Mar 24, 2026
bc1eeee
Wire VLM benchmarks end to end
davidberenstein1957 Mar 24, 2026
e54a9c2
Merge pull request #595 from PrunaAI/davidberenstein1957/vlm-metrics-…
davidberenstein1957 Mar 24, 2026
f01eee8
Merge main into feat/metrics-vlm-support
davidberenstein1957 Mar 24, 2026
1171de6
Fix VLM metric import ordering
davidberenstein1957 Mar 24, 2026
fbd88f7
Fix VLM outlines type checking
davidberenstein1957 Mar 24, 2026
ba8af2c
Limit CPU fallback to VLM metrics
davidberenstein1957 Mar 25, 2026
970c9ec
Fix QA accuracy litellm test input
davidberenstein1957 Mar 26, 2026
49f898e
Merge branch 'main' into feat/metrics-vlm-support
davidberenstein1957 Apr 1, 2026
6f17bb7
Refactor VLM metric checks and test assertions (#609)
davidberenstein1957 Apr 1, 2026
f6ea2be
Address remaining lint issues in metric and test updates
davidberenstein1957 Apr 1, 2026
982c78b
Fix numpydoc parameter docs for VLM metric constructors
davidberenstein1957 Apr 1, 2026
f1d0d73
Restore explicit VLM metric constructor parameters
davidberenstein1957 Apr 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/user_manual/configure.rst
Original file line number Diff line number Diff line change
Expand Up @@ -253,7 +253,7 @@ Underneath you can find the list of all the available datasets.
- ``text: str``
* - Image Generation
- `LAION256 <https://huggingface.co/datasets/nannullna/laion_subset>`_, `OpenImage <https://huggingface.co/datasets/data-is-better-together/open-image-preferences-v1>`_, `COCO <https://huggingface.co/datasets/phiyodr/coco2017>`_, `DrawBench <https://huggingface.co/datasets/sayakpaul/drawbench>`_, `PartiPrompts <https://huggingface.co/datasets/nateraw/parti-prompts>`_, `GenAIBench <https://huggingface.co/datasets/BaiqiL/GenAI-Bench>`_
- ``image_generation_collate``, ``prompt_collate``
- ``image_generation_collate``, ``prompt_with_auxiliaries_collate``
- ``text: str``, ``image: Optional[PIL.Image.Image]``
* - Image Classification
- `ImageNet <https://huggingface.co/datasets/zh-plus/tiny-imagenet>`_, `MNIST <https://huggingface.co/datasets/ylecun/mnist>`_, `CIFAR10 <https://huggingface.co/datasets/uoft-cs/cifar10>`_
Expand Down
5 changes: 5 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,10 @@ vllm = [
"vllm>=0.16.0",
"ray",
]
evaluation = [
"outlines>1.2.0,<2.0.0",
"litellm>=1.0.0",
]
Comment on lines +168 to +171
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@begumcig do you feel it is a good point to start metrics specifically related to evals?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I am not really sure actually, do you think we should have one big evaluation dependency, or group them with respect to the benchmarks?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could also be an option, but I also feel that the eval group per benchmark only makes sense if we also loosen the rest of the dependencies per algorithm. For now, we ca perhaps just add them to the global overview and perhaps do the seperation in v1 of Pruna?

stable-fast = [
"xformers>=0.0.30",
"stable-fast-pruna==1.0.8",
Expand Down Expand Up @@ -217,6 +221,7 @@ dev = [
"types-PyYAML",
"logbar",
"pytest-xdist>=3.8.0",
"pruna[evaluation]",
]
cpu = []
lmharness = [
Expand Down
4 changes: 2 additions & 2 deletions src/pruna/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,13 +103,13 @@
"image_classification_collate",
{"img_size": 224},
),
"DrawBench": (setup_drawbench_dataset, "prompt_collate", {}),
"DrawBench": (setup_drawbench_dataset, "prompt_with_auxiliaries_collate", {}),
"PartiPrompts": (
setup_parti_prompts_dataset,
"prompt_with_auxiliaries_collate",
{},
),
"GenAIBench": (setup_genai_bench_dataset, "prompt_collate", {}),
"GenAIBench": (setup_genai_bench_dataset, "prompt_with_auxiliaries_collate", {}),
"GenEval": (setup_geneval_dataset, "prompt_with_auxiliaries_collate", {}),
"HPS": (setup_hps_dataset, "prompt_with_auxiliaries_collate", {}),
"ImgEdit": (setup_imgedit_dataset, "prompt_with_auxiliaries_collate", {}),
Expand Down
1 change: 0 additions & 1 deletion src/pruna/data/collate.py
Original file line number Diff line number Diff line change
Expand Up @@ -321,6 +321,5 @@ def question_answering_collate(
"image_classification_collate": image_classification_collate,
"text_generation_collate": text_generation_collate,
"question_answering_collate": question_answering_collate,
"prompt_collate": prompt_collate,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we erase this here? I believe the prompt collate is still in this file. I think we should either keep this even if it's no longer used, for future prompt only datasets, or also remove the prompt_collate function above.

"prompt_with_auxiliaries_collate": prompt_with_auxiliaries_collate,
}
64 changes: 40 additions & 24 deletions src/pruna/data/datasets/prompt.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,14 @@
DPGCategory = Literal["entity", "attribute", "relation", "global", "other"]


def _warn_ignored_benchmark_seed(seed: int | None, *, dataset: str) -> None:
if seed is not None:
pruna_logger.warning(
"%s: `seed` is ignored for this test-only benchmark; sampling does not shuffle the test split.",
dataset,
)


def _to_oneig_record(row: dict, questions_by_key: dict[str, dict]) -> dict:
"""Convert OneIG row to unified record format."""
row_category = row.get("category", "")
Expand Down Expand Up @@ -159,7 +167,7 @@ def setup_drawbench_dataset() -> Tuple[Dataset, Dataset, Dataset]:


def setup_parti_prompts_dataset(
seed: int,
seed: int | None = None,
fraction: float = 1.0,
train_sample_size: int | None = None,
test_sample_size: int | None = None,
Expand All @@ -172,8 +180,8 @@ def setup_parti_prompts_dataset(

Parameters
----------
seed : int
The seed to use.
seed : int | None, optional
Ignored; test order is deterministic. If not None, a warning is logged.
fraction : float
The fraction of the dataset to use.
train_sample_size : int | None
Expand All @@ -188,6 +196,7 @@ def setup_parti_prompts_dataset(
Tuple[Dataset, Dataset, Dataset]
The Parti Prompts dataset (dummy train, dummy val, test).
"""
_warn_ignored_benchmark_seed(seed, dataset="PartiPrompts")
ds = load_dataset("nateraw/parti-prompts")["train"] # type: ignore[index]

if category is not None:
Expand Down Expand Up @@ -226,7 +235,7 @@ def _generate_geneval_question(entry: dict) -> list[str]:


def setup_geneval_dataset(
seed: int,
seed: int | None = None,
fraction: float = 1.0,
train_sample_size: int | None = None,
test_sample_size: int | None = None,
Expand All @@ -239,8 +248,8 @@ def setup_geneval_dataset(

Parameters
----------
seed : int
The seed to use.
seed : int | None, optional
Ignored; test order is deterministic. If not None, a warning is logged.
fraction : float
The fraction of the dataset to use.
train_sample_size : int | None
Expand All @@ -255,6 +264,7 @@ def setup_geneval_dataset(
Tuple[Dataset, Dataset, Dataset]
The GenEval dataset (dummy train, dummy val, test).
"""
_warn_ignored_benchmark_seed(seed, dataset="GenEval")
import json

import requests
Expand Down Expand Up @@ -286,7 +296,7 @@ def setup_geneval_dataset(


def setup_hps_dataset(
seed: int,
seed: int | None = None,
fraction: float = 1.0,
train_sample_size: int | None = None,
test_sample_size: int | None = None,
Expand All @@ -299,8 +309,8 @@ def setup_hps_dataset(

Parameters
----------
seed : int
The seed to use.
seed : int | None, optional
Ignored; test order is deterministic. If not None, a warning is logged.
fraction : float
The fraction of the dataset to use.
train_sample_size : int | None
Expand All @@ -315,6 +325,7 @@ def setup_hps_dataset(
Tuple[Dataset, Dataset, Dataset]
The HPD dataset (dummy train, dummy val, test).
"""
_warn_ignored_benchmark_seed(seed, dataset="HPS")
import json

from huggingface_hub import hf_hub_download
Expand All @@ -338,7 +349,7 @@ def setup_hps_dataset(


def setup_long_text_bench_dataset(
seed: int,
seed: int | None = None,
fraction: float = 1.0,
train_sample_size: int | None = None,
test_sample_size: int | None = None,
Expand All @@ -350,8 +361,8 @@ def setup_long_text_bench_dataset(

Parameters
----------
seed : int
The seed to use.
seed : int | None, optional
Ignored; test order is deterministic. If not None, a warning is logged.
fraction : float
The fraction of the dataset to use.
train_sample_size : int | None
Expand All @@ -364,6 +375,7 @@ def setup_long_text_bench_dataset(
Tuple[Dataset, Dataset, Dataset]
The Long Text Bench dataset (dummy train, dummy val, test).
"""
_warn_ignored_benchmark_seed(seed, dataset="LongTextBench")
ds = load_dataset("X-Omni/LongText-Bench")["train"] # type: ignore[index]
ds = ds.rename_column("text", "text_content")
ds = ds.rename_column("prompt", "text")
Expand All @@ -390,7 +402,7 @@ def setup_genai_bench_dataset() -> Tuple[Dataset, Dataset, Dataset]:


def setup_imgedit_dataset(
seed: int,
seed: int | None = None,
fraction: float = 1.0,
train_sample_size: int | None = None,
test_sample_size: int | None = None,
Expand All @@ -403,8 +415,8 @@ def setup_imgedit_dataset(

Parameters
----------
seed : int
The seed to use.
seed : int | None, optional
Ignored; test order is deterministic. If not None, a warning is logged.
fraction : float
The fraction of the dataset to use.
train_sample_size : int | None
Expand All @@ -420,6 +432,7 @@ def setup_imgedit_dataset(
Tuple[Dataset, Dataset, Dataset]
The ImgEdit dataset (dummy train, dummy val, test).
"""
_warn_ignored_benchmark_seed(seed, dataset="ImgEdit")
import json

import requests
Expand Down Expand Up @@ -493,7 +506,7 @@ def _fetch_oneig_alignment() -> dict[str, dict]:


def setup_oneig_dataset(
seed: int,
seed: int | None = None,
fraction: float = 1.0,
train_sample_size: int | None = None,
test_sample_size: int | None = None,
Expand All @@ -506,8 +519,8 @@ def setup_oneig_dataset(

Parameters
----------
seed : int
The seed to use.
seed : int | None, optional
Ignored; test order is deterministic. If not None, a warning is logged.
fraction : float
The fraction of the dataset to use.
train_sample_size : int | None
Expand All @@ -523,6 +536,7 @@ def setup_oneig_dataset(
Tuple[Dataset, Dataset, Dataset]
The OneIG dataset (dummy train, dummy val, test).
"""
_warn_ignored_benchmark_seed(seed, dataset="OneIG")
questions_by_key = _fetch_oneig_alignment()

ds_raw = load_dataset("OneIG-Bench/OneIG-Bench", "OneIG-Bench")["train"] # type: ignore[index]
Expand All @@ -545,7 +559,7 @@ def setup_oneig_dataset(


def setup_gedit_dataset(
seed: int,
seed: int | None = None,
fraction: float = 1.0,
train_sample_size: int | None = None,
test_sample_size: int | None = None,
Expand All @@ -558,8 +572,8 @@ def setup_gedit_dataset(

Parameters
----------
seed : int
The seed to use.
seed : int | None, optional
Ignored; test order is deterministic. If not None, a warning is logged.
fraction : float
The fraction of the dataset to use.
train_sample_size : int | None
Expand All @@ -576,6 +590,7 @@ def setup_gedit_dataset(
Tuple[Dataset, Dataset, Dataset]
The GEditBench dataset (dummy train, dummy val, test).
"""
_warn_ignored_benchmark_seed(seed, dataset="GEditBench")
task_type_map = {
"subject_add": "subject-add",
"subject_remove": "subject-remove",
Expand Down Expand Up @@ -613,7 +628,7 @@ def setup_gedit_dataset(


def setup_dpg_dataset(
seed: int,
seed: int | None = None,
fraction: float = 1.0,
train_sample_size: int | None = None,
test_sample_size: int | None = None,
Expand All @@ -626,8 +641,8 @@ def setup_dpg_dataset(

Parameters
----------
seed : int
The seed to use.
seed : int | None, optional
Ignored; test order is deterministic. If not None, a warning is logged.
fraction : float
The fraction of the dataset to use.
train_sample_size : int | None
Expand All @@ -642,6 +657,7 @@ def setup_dpg_dataset(
Tuple[Dataset, Dataset, Dataset]
The DPG dataset (dummy train, dummy val, test).
"""
_warn_ignored_benchmark_seed(seed, dataset="DPG")
import csv
import io
from collections import defaultdict
Expand Down
15 changes: 11 additions & 4 deletions src/pruna/data/pruna_datamodule.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ def from_string(
tokenizer: AutoTokenizer | None = None,
collate_fn_args: dict = dict(),
dataloader_args: dict = dict(),
seed: int = 42,
seed: int | None = None,
category: str | list[str] | None = None,
fraction: float = 1.0,
train_sample_size: int | None = None,
Expand All @@ -154,8 +154,10 @@ def from_string(
Any additional arguments for the collate function.
dataloader_args : dict
Any additional arguments for the dataloader.
seed : int
The seed to use.
seed : int | None, optional
Passed to dataset setup when the loader uses shuffled sampling.
If None, setups that require a seed default to 42; test-only benchmarks
omit seed so ordering stays deterministic without warnings.
category : str | list[str] | None
The category of the dataset.
fraction : float
Expand All @@ -177,7 +179,12 @@ def from_string(
collate_fn_args = default_collate_fn_args

if "seed" in inspect.signature(setup_fn).parameters:
setup_fn = partial(setup_fn, seed=seed)
seed_param = inspect.signature(setup_fn).parameters["seed"]
has_default = seed_param.default is not inspect.Parameter.empty
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just asking so I understood correctly, has_default would be True for the load functions where we set seed to None right?

if seed is not None:
setup_fn = partial(setup_fn, seed=seed)
elif not has_default:
setup_fn = partial(setup_fn, seed=42)

if "category" in inspect.signature(setup_fn).parameters:
setup_fn = partial(setup_fn, category=category)
Expand Down
Loading
Loading