Skip to content

eval kit support v2#1295

Open
Jorjeous wants to merge 2 commits intomainfrom
eval-kit-support-v2
Open

eval kit support v2#1295
Jorjeous wants to merge 2 commits intomainfrom
eval-kit-support-v2

Conversation

@Jorjeous
Copy link
Member

@Jorjeous Jorjeous commented Mar 7, 2026

Added ability to run eval kit in namo skills nativly, runnning benchmarks from EK in nemo skills with it's path and NS benchmarks throug EK path

Summary by CodeRabbit

  • New Features

    • VLMEvalKit integration added with two inference modes: in-process (mcore) and vLLM client; supports self-contained in-process evaluation workflows and merged multi-GPU output.
    • Support for consuming pre-computed VLMEvalKit metrics so aggregated results can be imported and reported.
  • Documentation

    • New user guide and index entry describing setup, modes, benchmarks, sample commands, config snippets, results interpretation, and troubleshooting.

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 7, 2026

📝 Walkthrough

Walkthrough

Integrates VLMEvalKit into NeMo Skills: adds docs, dataset scaffolding, EvalKitGenerationTask (mcore and vllm), Megatron in-process task, EvalKitMetrics, pipeline and utility wiring, and packaging/venv/runtime adjustments to run VLMEvalKit benchmarks and consume precomputed metrics.

Changes

Cohort / File(s) Summary
Documentation
docs/evaluation/eval-kit.md, docs/evaluation/index.md
Adds VLMEvalKit documentation and an index entry describing modes, prerequisites, commands, cluster config, outputs, and troubleshooting.
Dataset integration
nemo_skills/dataset/eval_kit/__init__.py, nemo_skills/dataset/utils.py
Adds eval_kit dataset module constants and get_extra_generation_args(); updates dataset resolver to special-case eval_kit.*.
Generation tasks (vLLM / EvalKit)
nemo_skills/inference/eval/eval_kit.py
New EvalKitGenerationTask and EvalKitConfig: supports mcore (in-process) and vllm modes, dataset build, async JSONL writer, inference & evaluation phases, output/metrics consolidation, and a Hydra entrypoint.
Generation tasks (Megatron/MCore)
nemo_skills/inference/mcore_skills.py
New MegatronMCoreConfig and MegatronMCoreGenerationTask: in-process MultiModalMCore integration, prompt handling, sharded per-rank IO, generation and WER evaluation, metrics file and .done marker; config store registration and Hydra main.
Generation framework hooks
nemo_skills/inference/generate.py, nemo_skills/inference/factory.py
Adds declarative task attributes/methods (CONTAINER_KEY, USE_TORCHRUN, is_self_contained, get_env_prefix, get_extra_package_dirs) and new GenerationType mcore_skills mapping.
Pipeline wiring & utils
nemo_skills/pipeline/eval.py, nemo_skills/pipeline/utils/eval.py, nemo_skills/pipeline/utils/generation.py
Adds _apply_task_overrides for env/torchrun/container selection, resolves GENERATION_TASK_CLASS, extends BenchmarkArgs (self_contained_task, num_gpus, generation_task_class), adjusts batching/extra args for self-contained tasks, conditional venv path using uv, and optional input_file handling.
Metrics and evaluator
nemo_skills/evaluation/metrics/eval_kit_metrics.py, nemo_skills/evaluation/metrics/map_metrics.py, nemo_skills/evaluation/metrics/translation_metrics.py, nemo_skills/evaluation/evaluator/audio.py
Adds EvalKitMetrics to load precomputed eval_kit_metrics.json and register it; lazy-import tweak for corpus_bleu; audio evaluator extended for ST-* translation normalization, ASR/ASR_LEADERBOARD per-field WER, MathQA support, and robust missing-generation handling.
Runtime requirements placeholder
requirements/eval-kit.txt
Adds placeholder requirements file documenting that VLMEvalKit is installed at job start via installation commands (no in-venv deps specified).

Sequence Diagram(s)

sequenceDiagram
    participant Pipeline as NeMo Skills Pipeline
    participant Task as EvalKitGenerationTask
    participant VLMDataset as VLMEvalKit (dataset build)
    participant Model as MultiModalMCore / VLLMLocal
    participant AsyncIO as Async JSONL Writer
    participant Evaluator as VLMEvalKit Evaluator
    participant Output as Result Files

    Pipeline->>Task: initialize(cfg)
    Task->>VLMDataset: build_dataset(vlm_dataset)
    VLMDataset-->>Task: dataset

    Task->>Task: setup_work_directories()
    Task->>AsyncIO: start_writer()

    loop for each sample
        Task->>Model: generate(sample)
        Model-->>Task: generation
        Task->>AsyncIO: write_result(jsonl)
    end

    Task->>Evaluator: evaluate_results()
    Evaluator-->>Task: eval_kit_metrics.json
    Task->>Output: write_metrics_and_done()
    Output-->>Pipeline: results_complete
Loading
sequenceDiagram
    participant Pipeline as NeMo Skills Pipeline
    participant Task as MegatronMCoreGenerationTask
    participant Prompt as PromptFiller
    participant Data as DataLoader
    participant MCore as MultiModalMCore
    participant IO as Per-rank JSONL IO
    participant Eval as ASR/WER Evaluator

    Pipeline->>Task: initialize(cfg)
    Task->>MCore: _make_mcore_model(config)
    Task->>Data: load_and_shard_data()
    opt prompt_config
        Task->>Prompt: build_prompt()
        Prompt-->>Task: prompt_template
    end

    loop each sample (per rank)
        Task->>MCore: _generate_for_sample(messages/prompt)
        MCore-->>Task: raw_output
        Task->>IO: write_rank_output()
    end

    Task->>Task: merge_rank_outputs()
    Task->>Eval: _evaluate_results()
    Eval-->>Task: wer_metrics
    Task->>IO: write_eval_kit_metrics.json + .done
    IO-->>Pipeline: completed_with_metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related PRs

  • PR #1239: Directly related—implements the same VLMEvalKit integration (eval_kit modules, EvalKitGenerationTask, Megatron mcore task, metrics, pipeline wiring).
  • PR #1294: Directly related—reverts/removes the eval_kit integration added by this change (deletes docs, modules, and wiring).
  • PR #1133: Related—overlaps in generation task declarative hooks and pipeline orchestration changes relevant to self-contained/in-process tasks.

Suggested reviewers

  • melllinia
  • gwarmstrong
🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title "eval kit support v2" is vague and generic, using non-descriptive terms that lack clarity about the specific changes or scope of the pull request. Consider a more descriptive title that highlights the main change, such as "Add VLMEvalKit integration for native benchmark execution" or similar.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch eval-kit-support-v2

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/eval-kit.md`:
- Around line 238-245: The fenced code block in docs/evaluation/eval-kit.md
showing the results directory tree is missing a language label; change the
opening fence from ``` to ```text so markdownlint recognizes it as a plain-text
block (keep the closing ``` unchanged) to fix the lint error.
- Around line 76-95: Add a concrete, known-good metric example to the eval_kit
docs: insert a short example block showing the exact command (using the existing
run example with server_type=megatron, ++model_type=mcore, ++model_config and
++load_dir) and append an "Expected result" line that names the benchmark
(eval_kit.LibriSpeech_test_clean) and a validated metric (e.g., WER=12.3% on
that checkpoint). Place this example near the Mode 1: Megatron in-process
(mcore) section (around the existing eval command shown) and mirror similar
expected-result additions where the docs reference benchmarks (the areas around
the other noted sections for 118-133 and 247-253) so readers have one concrete
score to verify bring-up.

In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 523-535: The early-return for missing generations only returns
global wer/bleu/cer fields, which skips per-field leaderboard metrics when
reference_fields is used; update the missing-generation branches in the
evaluator (where task_type, generation are checked) to detect reference_fields
and populate per-field defaults (e.g. for each suffix in reference_fields emit
wer_<suffix>=1.0 and is_correct_<suffix>=False, and for ASR-PC also populate
wer_c_<suffix>, wer_pc_<suffix>, per_<suffix>=1.0) in addition to the global
fields; make the same change in the other missing-generation block near the
later duplicated section so empty generations count as worst-case failures for
per-field ASR_LEADERBOARD metrics.

In `@nemo_skills/evaluation/metrics/eval_kit_metrics.py`:
- Around line 79-86: The flattener in get_metrics()/summarize_results currently
only keeps numeric scalars from eval_kit_results and drops ASR Eval Kit payloads
where the score is inside a string field (e.g., value["result"] is a JSON
string); update the logic that iterates eval_kit_results (the loop over key,
value and nested sub_key, sub_value that writes into agg_dict) to detect when a
value or sub_value is a string containing JSON (specifically the ASR payload),
parse that JSON and extract concrete numeric metrics (e.g., score, wer,
num_entries) and insert them into agg_dict using normalized keys like
"{key}_result_score" or "{key}_{sub_key}_score" so summarize_results and the
written eval_kit_metrics.json preserve the actual numeric Eval Kit scores
instead of only num_entries.

In `@nemo_skills/inference/eval/eval_kit.py`:
- Around line 540-546: The current write-out in eval_kit.py turns non-dict
tabular eval_result into {"result": str(eval_result)}, losing structured
metrics; update the block handling eval_result in the function that writes
eval_kit_metrics.json so that when eval_result is a pandas.DataFrame or any
DataFrame-like (e.g., pyarrow table) you convert it to a JSON-serializable
structure (e.g., DataFrame.to_dict(orient="records") or to_dict() for a
single-row metrics map) and for numpy/scalar types normalize to native Python
types before json.dump; keep the output key as a dict (e.g., metrics_data) and
only fall back to str(eval_result) for truly non-serializable objects so
EvalKitMetrics can consume machine-readable aggregates.
- Around line 95-97: The skip_filled parameter is accepted but ignored; either
enforce it or implement resume behavior: update _start_async_writer() to check
the skip_filled flag and, when True, avoid deleting or truncating an existing
output_file and instead open it for appending/skip already-processed entries
(implement any required logic to detect/skippable records), or if you choose not
to support resume for this task, validate skip_filled early (e.g., in VLMEvalKit
initializer or the config validation path) and raise a clear exception
(ValueError) when skip_filled is True; apply the same change/validation to the
other occurrence referenced around the block at lines 328-331 so the flag is not
silently ignored.

In `@nemo_skills/inference/mcore_skills.py`:
- Around line 130-135: METRICS_TYPE_OVERRIDE currently forces summarize to
expect eval_kit_metrics.json but _evaluate_results() only computes ASR WER via
asr_wer() and writes {"wer":...}, and generate() swallows exceptions so non-ASR
benchmarks produce missing/incorrect metrics silently; update
_evaluate_results() to branch on the benchmark/type (or a metric_type field) and
compute/write the appropriate metric payload (not only wer) for each supported
eval_kit scenario, ensure asr_wer() is only invoked when the benchmark is ASR,
and remove or rework the broad exception handling in generate()/the 526-527
catch so unexpected failures propagate instead of creating a .done marker; refer
to METRICS_TYPE_OVERRIDE, _evaluate_results(), asr_wer(), and generate() when
making the changes.
- Around line 267-330: The _build_mcore_messages function currently buffers all
text in text_parts and appends one combined text at the end, which reorders and
collapses interleaved media/text; change it to preserve inline ordering by
emitting entries into mcore as you iterate messages and content (instead of
accumulating into text_parts): whenever you encounter a text string or a content
item with type "text", append a {"type":"text","value":...} to mcore immediately
(optionally merging only consecutive text fragments), and for image/audio items
resolve paths via _resolve_path and append their {"type":"image"/"sound",...}
entries in the same spot; remove the final combined_text join/append and ensure
use of the existing symbols messages, mcore, text_parts (or drop text_parts) and
_resolve_path to locate the changes.

In `@nemo_skills/pipeline/eval.py`:
- Around line 448-459: The code currently calls generation task classes'
get_extra_package_dirs() with no environment context, so env-only cluster YAML
vars (from pipeline_utils.get_env_variables(cluster_config) stored in env_vars)
aren't visible; modify the loop over benchmarks_dict to pass env_vars into
get_extra_package_dirs when supported: import inspect, check the signature of
task_cls.get_extra_package_dirs and if it accepts a parameter call
task_cls.get_extra_package_dirs(env_vars), otherwise call it with no args as a
fallback, preserve the seen_pkg_dirs logic and LOG.info usage, and ensure
extra_pkg_dirs still becomes None when empty; reference generation_task_class,
get_extra_package_dirs, pipeline_utils.get_env_variables, env_vars and
benchmarks_dict in your change.

In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 448-452: The current batching only forces separate jobs for
self-contained tasks (has_self_contained) but still allows mixing different
generation task classes (e.g., eval_kit.* in vllm mode vs plain GenerationTask)
into the same job which causes _apply_task_overrides() to pick one env/container
for the whole batch; update the logic that sets num_jobs (or the batching step)
to either (a) enforce one job per distinct generation task class present in the
batch (treat each unique task class as its own group) or (b) add a validation
step before batching that inspects the task classes in each proposed batch and
raises/adjusts when they do not all agree on runtime overrides (so
_apply_task_overrides() will not be applied to a mixed-class batch). Reference
has_self_contained, num_jobs, total_evals and
nemo_skills/pipeline/eval.py::_apply_task_overrides() when making the change.
- Around line 406-410: When detecting a self-contained task (inside the block
that checks task_cls and hasattr(..., "is_self_contained")), fail fast if
server_parameters["server_gpus"] is falsy instead of leaving ba.num_gpus unset:
after setting ba.self_contained_task = True, check
server_parameters["server_gpus"] and if it is None/0/False raise a clear
exception (e.g., ValueError) describing that a self-contained task requires a
GPU allocation; otherwise assign ba.num_gpus = server_parameters["server_gpus"]
so num_gpus is always set for self-contained tasks.

In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 465-467: The code currently omits ++input_file when input_file is
None which can silently allow callers to produce generation commands with
neither input_file nor input_dir; update the generation command builder (the
block that sets common_args and uses input_file/input_dir) to keep the
conditional inclusion of ++input_file but add explicit validation: if the job is
not self-contained (e.g., a parameter/self_contained flag) and both input_file
and input_dir are None, raise a clear ValueError/ConfigurationError; reference
the variables common_args, input_file, input_dir and the surrounding generation
function so the check runs before building common_args.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a9e490c0-062e-44ab-80b0-e7bb8dcfc02c

📥 Commits

Reviewing files that changed from the base of the PR and between a5da597 and 2116473.

📒 Files selected for processing (16)
  • docs/evaluation/eval-kit.md
  • docs/evaluation/index.md
  • nemo_skills/dataset/eval_kit/__init__.py
  • nemo_skills/dataset/utils.py
  • nemo_skills/evaluation/evaluator/audio.py
  • nemo_skills/evaluation/metrics/eval_kit_metrics.py
  • nemo_skills/evaluation/metrics/map_metrics.py
  • nemo_skills/evaluation/metrics/translation_metrics.py
  • nemo_skills/inference/eval/eval_kit.py
  • nemo_skills/inference/factory.py
  • nemo_skills/inference/generate.py
  • nemo_skills/inference/mcore_skills.py
  • nemo_skills/pipeline/eval.py
  • nemo_skills/pipeline/utils/eval.py
  • nemo_skills/pipeline/utils/generation.py
  • requirements/eval-kit.txt

Comment on lines +76 to +95
## Running eval_kit Benchmarks

### Mode 1: Megatron in-process (mcore)

This is the primary mode. The model runs directly inside the `torchrun` process — no separate server.

```bash
export NEMO_SKILLS_VLMEVALKIT_PATH=/path/to/VLMEvalKitMcore

ns eval \
--cluster=my_cluster \
--output_dir=/path/to/results \
--benchmarks=eval_kit.LibriSpeech_test_clean \
--server_type=megatron \
--server_gpus=8 \
--server_container=/path/to/eval-kit-nemo-skills.sqsh \
++model_type=mcore \
++model_config=/path/to/config.yaml \
++load_dir=/path/to/checkpoint/TP_1/
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add one known-good metric example.

The page shows how to launch eval_kit and what files get written, but it still doesn't give readers a concrete score to expect from a validated model/benchmark pair. A single “we tested X on Y and saw Z” example would make bring-up much easier.

Based on learnings: Applies to /{benchmarks,docs}//*.{md,py} : When adding new benchmarks, add it to the corresponding place in the documentation with example commands for running evaluation and expected results for tested models

Also applies to: 118-133, 247-253

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/eval-kit.md` around lines 76 - 95, Add a concrete, known-good
metric example to the eval_kit docs: insert a short example block showing the
exact command (using the existing run example with server_type=megatron,
++model_type=mcore, ++model_config and ++load_dir) and append an "Expected
result" line that names the benchmark (eval_kit.LibriSpeech_test_clean) and a
validated metric (e.g., WER=12.3% on that checkpoint). Place this example near
the Mode 1: Megatron in-process (mcore) section (around the existing eval
command shown) and mirror similar expected-result additions where the docs
reference benchmarks (the areas around the other noted sections for 118-133 and
247-253) so readers have one concrete score to verify bring-up.

Comment on lines +238 to +245
```
<output_dir>/
└── eval-results/
└── eval_kit.LibriSpeech_test_clean/
├── output.jsonl # Per-sample results (generation + expected_answer)
├── eval_kit_metrics.json # Aggregate metrics from VLMEvalKit
└── metrics.json # NeMo Skills summary
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language to this fenced block.

The results-tree block is unlabeled, so markdownlint will flag it.

🛠️ Suggested fix
-```
+```text
 <output_dir>/
 └── eval-results/
     └── eval_kit.LibriSpeech_test_clean/
         ├── output.jsonl              # Per-sample results (generation + expected_answer)
         ├── eval_kit_metrics.json     # Aggregate metrics from VLMEvalKit
         └── metrics.json              # NeMo Skills summary
-```
+```
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```
<output_dir>/
└── eval-results/
└── eval_kit.LibriSpeech_test_clean/
├── output.jsonl # Per-sample results (generation + expected_answer)
├── eval_kit_metrics.json # Aggregate metrics from VLMEvalKit
└── metrics.json # NeMo Skills summary
```
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 238-238: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/eval-kit.md` around lines 238 - 245, The fenced code block in
docs/evaluation/eval-kit.md showing the results directory tree is missing a
language label; change the opening fence from ``` to ```text so markdownlint
recognizes it as a plain-text block (keep the closing ``` unchanged) to fix the
lint error.

Comment on lines +523 to +535
if task_type in (_ASR_TYPES | _TRANSLATION_TYPES | {"CER"}) and not generation:
base = {
"is_correct": False,
"error": "missing_generation",
}
if task_type in _TRANSLATION_TYPES:
return {**base, "bleu": 0.0}
if task_type == "CER":
return {**base, "cer": 1.0}
if task_type == "ASR-PC":
return {**base, "wer": 1.0, "wer_c": 1.0, "wer_pc": 1.0, "per": 1.0}
# ASR / ASR-ZH / ASR_LEADERBOARD
return {**base, "wer": 1.0}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Populate per-field WER defaults for missing leaderboard outputs.

When reference_fields is set, the normal ASR_LEADERBOARD path emits wer_<suffix> and is_correct_<suffix>, but the missing-generation early return only emits wer. Empty generations then stop contributing to the per-field leaderboard metrics instead of counting as worst-case failures.

🛠️ Suggested fix
     if task_type in (_ASR_TYPES | _TRANSLATION_TYPES | {"CER"}) and not generation:
         base = {
             "is_correct": False,
             "error": "missing_generation",
         }
         if task_type in _TRANSLATION_TYPES:
             return {**base, "bleu": 0.0}
         if task_type == "CER":
             return {**base, "cer": 1.0}
         if task_type == "ASR-PC":
             return {**base, "wer": 1.0, "wer_c": 1.0, "wer_pc": 1.0, "per": 1.0}
+        if task_type == "ASR_LEADERBOARD":
+            updates = {**base, "wer": 1.0}
+            if config.reference_fields:
+                for ref_field in config.reference_fields:
+                    metric_suffix = ref_field.replace("text_", "") if ref_field.startswith("text_") else ref_field
+                    updates[f"wer_{metric_suffix}"] = 1.0
+                    updates[f"is_correct_{metric_suffix}"] = False
+            return updates
         # ASR / ASR-ZH / ASR_LEADERBOARD
         return {**base, "wer": 1.0}

Also applies to: 553-567

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/audio.py` around lines 523 - 535, The
early-return for missing generations only returns global wer/bleu/cer fields,
which skips per-field leaderboard metrics when reference_fields is used; update
the missing-generation branches in the evaluator (where task_type, generation
are checked) to detect reference_fields and populate per-field defaults (e.g.
for each suffix in reference_fields emit wer_<suffix>=1.0 and
is_correct_<suffix>=False, and for ASR-PC also populate wer_c_<suffix>,
wer_pc_<suffix>, per_<suffix>=1.0) in addition to the global fields; make the
same change in the other missing-generation block near the later duplicated
section so empty generations count as worst-case failures for per-field
ASR_LEADERBOARD metrics.

Comment on lines +79 to +86
for key, value in eval_kit_results.items():
if isinstance(value, dict):
# Nested results (e.g., per-category scores)
for sub_key, sub_value in value.items():
if isinstance(sub_value, (int, float)):
agg_dict[f"{key}_{sub_key}"] = sub_value
elif isinstance(value, (int, float)):
agg_dict[key] = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This drops the documented ASR Eval Kit payload.

The flattener only keeps numeric scalars, but the new doc shows eval_kit_metrics.json can contain ASR results under a string result field. In that case get_metrics() returns only num_entries, so summarize_results loses the actual Eval Kit score. Please normalize the file to concrete numeric keys before writing it, or parse the result payload here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/eval_kit_metrics.py` around lines 79 - 86, The
flattener in get_metrics()/summarize_results currently only keeps numeric
scalars from eval_kit_results and drops ASR Eval Kit payloads where the score is
inside a string field (e.g., value["result"] is a JSON string); update the logic
that iterates eval_kit_results (the loop over key, value and nested sub_key,
sub_value that writes into agg_dict) to detect when a value or sub_value is a
string containing JSON (specifically the ASR payload), parse that JSON and
extract concrete numeric metrics (e.g., score, wer, num_entries) and insert them
into agg_dict using normalized keys like "{key}_result_score" or
"{key}_{sub_key}_score" so summarize_results and the written
eval_kit_metrics.json preserve the actual numeric Eval Kit scores instead of
only num_entries.

Comment on lines +95 to +97
work_dir: str = "./outputs"
output_file: str = ""
skip_filled: bool = False # Accepted from pipeline but unused (VLMEvalKit has its own resume)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

skip_filled is a silent no-op here.

The config accepts skip_filled, but this task never branches on it and _start_async_writer() always removes the existing output file. Users can ask for resume/skip semantics and still get a rewrite. Either implement the flag end-to-end or reject it explicitly for this task. As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".

Also applies to: 328-331

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/eval_kit.py` around lines 95 - 97, The skip_filled
parameter is accepted but ignored; either enforce it or implement resume
behavior: update _start_async_writer() to check the skip_filled flag and, when
True, avoid deleting or truncating an existing output_file and instead open it
for appending/skip already-processed entries (implement any required logic to
detect/skippable records), or if you choose not to support resume for this task,
validate skip_filled early (e.g., in VLMEvalKit initializer or the config
validation path) and raise a clear exception (ValueError) when skip_filled is
True; apply the same change/validation to the other occurrence referenced around
the block at lines 328-331 so the flag is not silently ignored.

Comment on lines +267 to +330
def _build_mcore_messages(self, data_point: dict) -> list | None:
"""Convert a NeMo Skills data point into MultiModalMCore message list.

If the data point has a 'messages' field (OpenAI format), converts it to
list[dict] with types: "text", "image", "sound".

Only user/assistant message text is included — system messages are skipped
because MultiModalMCore's generate_inner() builds its own prompt template
with system/user roles internally.

If no 'messages' field, returns None (caller should use fill_prompt for text-only).
"""
messages = data_point.get("messages")
if not messages:
return None

mcore: list[dict] = []
text_parts: list[str] = []

for msg in messages:
if not isinstance(msg, dict):
continue
role = msg.get("role", "")
content = msg.get("content", "")

# Skip system messages — generate_inner builds its own system prompt.
if role == "system":
continue

# Audio: single or multiple
if "audio" in msg:
audio = msg["audio"]
if isinstance(audio, dict) and "path" in audio:
path = self._resolve_path(audio["path"])
mcore.append({"type": "sound", "value": path, "sample_rate": 16000})
if "audios" in msg:
for audio in msg["audios"]:
if isinstance(audio, dict) and "path" in audio:
path = self._resolve_path(audio["path"])
mcore.append({"type": "sound", "value": path, "sample_rate": 16000})

# Content: str or list of content items (text, image_url)
if isinstance(content, str):
if content.strip():
text_parts.append(content.strip())
elif isinstance(content, list):
for item in content:
if isinstance(item, dict):
if item.get("type") == "text" and "text" in item:
text_parts.append(item["text"].strip())
elif item.get("type") == "image_url":
image_url = item.get("image_url") or {}
url = image_url.get("url", "")
if url.startswith("file://"):
path = url[7:]
else:
path = url
if path:
path = self._resolve_path(path)
mcore.append({"type": "image", "value": path})

combined_text = "\n".join(t for t in text_parts if t)
if combined_text:
mcore.append({"type": "text", "value": combined_text})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve multimodal content order when building MCore inputs.

This code buffers every text fragment in text_parts and appends one text item at the end, so [text, image] becomes [image, text] and multi-turn text gets collapsed into a single block. That changes prompt semantics for benchmarks with interleaved media or dialogue.

One way to preserve inline ordering
-        mcore: list[dict] = []
-        text_parts: list[str] = []
+        mcore: list[dict] = []

         for msg in messages:
             if not isinstance(msg, dict):
                 continue
@@
             if isinstance(content, str):
                 if content.strip():
-                    text_parts.append(content.strip())
+                    mcore.append({"type": "text", "value": content.strip()})
             elif isinstance(content, list):
                 for item in content:
                     if isinstance(item, dict):
                         if item.get("type") == "text" and "text" in item:
-                            text_parts.append(item["text"].strip())
+                            text = item["text"].strip()
+                            if text:
+                                mcore.append({"type": "text", "value": text})
                         elif item.get("type") == "image_url":
                             image_url = item.get("image_url") or {}
                             url = image_url.get("url", "")
@@
-        combined_text = "\n".join(t for t in text_parts if t)
-        if combined_text:
-            mcore.append({"type": "text", "value": combined_text})
-
         if not mcore:
             return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/mcore_skills.py` around lines 267 - 330, The
_build_mcore_messages function currently buffers all text in text_parts and
appends one combined text at the end, which reorders and collapses interleaved
media/text; change it to preserve inline ordering by emitting entries into mcore
as you iterate messages and content (instead of accumulating into text_parts):
whenever you encounter a text string or a content item with type "text", append
a {"type":"text","value":...} to mcore immediately (optionally merging only
consecutive text fragments), and for image/audio items resolve paths via
_resolve_path and append their {"type":"image"/"sound",...} entries in the same
spot; remove the final combined_text join/append and ensure use of the existing
symbols messages, mcore, text_parts (or drop text_parts) and _resolve_path to
locate the changes.

Comment on lines +448 to +459
# Build extra_package_dirs: include any dirs declared by generation task classes
extra_pkg_dirs = []
seen_pkg_dirs = set()
for ba in benchmarks_dict.values():
task_cls = ba.generation_task_class
if task_cls is not None and hasattr(task_cls, "get_extra_package_dirs"):
for pkg_dir in task_cls.get_extra_package_dirs():
if pkg_dir not in seen_pkg_dirs:
seen_pkg_dirs.add(pkg_dir)
extra_pkg_dirs.append(pkg_dir)
LOG.info("Packaging extra dir from %s: %s", task_cls.__name__, pkg_dir)
extra_pkg_dirs = extra_pkg_dirs or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Resolve extra_package_dirs from cluster_config env vars, not just the submitter’s shell.

env_vars = pipeline_utils.get_env_variables(cluster_config) is already available above, but get_extra_package_dirs() is called without that context. The eval-kit task classes currently read NEMO_SKILLS_VLMEVALKIT_PATH from os.environ, so a path defined only in the cluster YAML is invisible during packaging and vlmeval/ never gets shipped with the job.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 448 - 459, The code currently
calls generation task classes' get_extra_package_dirs() with no environment
context, so env-only cluster YAML vars (from
pipeline_utils.get_env_variables(cluster_config) stored in env_vars) aren't
visible; modify the loop over benchmarks_dict to pass env_vars into
get_extra_package_dirs when supported: import inspect, check the signature of
task_cls.get_extra_package_dirs and if it accepts a parameter call
task_cls.get_extra_package_dirs(env_vars), otherwise call it with no args as a
fallback, preserve the seen_pkg_dirs logic and LOG.info usage, and ensure
extra_pkg_dirs still becomes None when empty; reference generation_task_class,
get_extra_package_dirs, pipeline_utils.get_env_variables, env_vars and
benchmarks_dict in your change.

Comment on lines +406 to +410
if task_cls is not None and hasattr(task_cls, "is_self_contained"):
if task_cls.is_self_contained(extra_arguments):
ba.self_contained_task = True
if server_parameters["server_gpus"]:
ba.num_gpus = server_parameters["server_gpus"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast when a self-contained task has no GPU allocation.

ba.self_contained_task = True can be set while server_gpus is still None, which leaves ba.num_gpus unset. nemo_skills/pipeline/eval.py then submits the in-process mcore job with num_gpus=None, so it only fails much later inside the container instead of at argument-validation time.

Suggested guard
         if task_cls is not None and hasattr(task_cls, "is_self_contained"):
             if task_cls.is_self_contained(extra_arguments):
                 ba.self_contained_task = True
-                if server_parameters["server_gpus"]:
-                    ba.num_gpus = server_parameters["server_gpus"]
+                if not server_parameters["server_gpus"]:
+                    raise ValueError(
+                        f"Benchmark {ba.name} uses a self-contained generation task and requires --server_gpus."
+                    )
+                ba.num_gpus = server_parameters["server_gpus"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/eval.py` around lines 406 - 410, When detecting a
self-contained task (inside the block that checks task_cls and hasattr(...,
"is_self_contained")), fail fast if server_parameters["server_gpus"] is falsy
instead of leaving ba.num_gpus unset: after setting ba.self_contained_task =
True, check server_parameters["server_gpus"] and if it is None/0/False raise a
clear exception (e.g., ValueError) describing that a self-contained task
requires a GPU allocation; otherwise assign ba.num_gpus =
server_parameters["server_gpus"] so num_gpus is always set for self-contained
tasks.

Comment on lines +448 to +452
# Self-contained tasks (e.g., eval_kit mcore mode) bypass the server/client split
# and manage their own GPU allocation, so each benchmark must get its own job (no grouping).
if has_self_contained and num_jobs != total_evals:
LOG.info("Self-contained tasks detected, forcing num_jobs = total_evals (no job grouping).")
num_jobs = total_evals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't batch benchmarks with different runtime hooks into one job.

Only self-contained tasks force num_jobs = total_evals. An eval_kit.* benchmark in vllm mode can still share a job with a plain GenerationTask benchmark, and nemo_skills/pipeline/eval.py::_apply_task_overrides() will then choose a single env prefix/container for both. That makes the runtime depend on whichever benchmark happens to be first in the batch. Please enforce one generation task class per job, or validate that all task classes in a batch agree on these overrides.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/eval.py` around lines 448 - 452, The current
batching only forces separate jobs for self-contained tasks (has_self_contained)
but still allows mixing different generation task classes (e.g., eval_kit.* in
vllm mode vs plain GenerationTask) into the same job which causes
_apply_task_overrides() to pick one env/container for the whole batch; update
the logic that sets num_jobs (or the batching step) to either (a) enforce one
job per distinct generation task class present in the batch (treat each unique
task class as its own group) or (b) add a validation step before batching that
inspects the task classes in each proposed batch and raises/adjusts when they do
not all agree on runtime overrides (so _apply_task_overrides() will not be
applied to a mixed-class batch). Reference has_self_contained, num_jobs,
total_evals and nemo_skills/pipeline/eval.py::_apply_task_overrides() when
making the change.

Comment on lines +465 to +467
common_args = f"++skip_filled=True ++output_file={output_file}"
if input_file is not None:
common_args += f" ++input_file={input_file}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep missing input as an explicit opt-in.

Making ++input_file conditional is necessary for self-contained Eval Kit jobs, but changing the default here also lets ordinary callers emit a generation command with neither input_file nor input_dir. That turns a clear configuration error into a much later failure path for every non-self-contained benchmark.

🛠️ Suggested fix
 def get_generation_cmd(
     output_dir,
     input_file=None,
     input_dir=None,
+    allow_missing_input: bool = False,
     extra_arguments="",
     random_seed=None,
@@
     if input_file is not None and input_dir is not None:
         raise ValueError("Please provide either input_file or input_dir, not both.")
+    if input_file is None and input_dir is None and not allow_missing_input:
+        raise ValueError("Please provide either input_file or input_dir.")
@@
     common_args = f"++skip_filled=True ++output_file={output_file}"
     if input_file is not None:
         common_args += f" ++input_file={input_file}"
As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/generation.py` around lines 465 - 467, The code
currently omits ++input_file when input_file is None which can silently allow
callers to produce generation commands with neither input_file nor input_dir;
update the generation command builder (the block that sets common_args and uses
input_file/input_dir) to keep the conditional inclusion of ++input_file but add
explicit validation: if the job is not self-contained (e.g., a
parameter/self_contained flag) and both input_file and input_dir are None, raise
a clear ValueError/ConfigurationError; reference the variables common_args,
input_file, input_dir and the surrounding generation function so the check runs
before building common_args.

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
nemo_skills/pipeline/eval.py (1)

448-459: ⚠️ Potential issue | 🟠 Major

Resolve extra package dirs from cluster-config env vars too.

env_vars = pipeline_utils.get_env_variables(cluster_config) is already available above, but get_extra_package_dirs() is still called without that context. If NEMO_SKILLS_VLMEVALKIT_PATH is defined only in the cluster YAML, packaging will miss the Eval Kit sources and the submitted job won't ship them.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 448 - 459, The loop that builds
extra_pkg_dirs calls task_cls.get_extra_package_dirs() without passing env_vars
(from pipeline_utils.get_env_variables(cluster_config)), which misses dirs
declared via cluster-config env vars; update the code in the extra_pkg_dirs
construction to call get_extra_package_dirs with env_vars when available:
attempt task_cls.get_extra_package_dirs(env_vars) and fall back to
task_cls.get_extra_package_dirs() (e.g., catch TypeError) so benchmarks_dict ->
ba.generation_task_class and its get_extra_package_dirs can resolve
cluster-config paths like NEMO_SKILLS_VLMEVALKIT_PATH.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 51-54: The current logic only swaps the first "python -m " to
"torchrun ..." because combined_cmd.replace(..., 1) targets a single occurrence,
leaving later module launches unwrapped; update the transformation so every
"python -m " in combined_cmd is prefixed with "torchrun --nproc_per_node
{job_num_gpus} -m" when any task class has USE_TORCHRUN and job_num_gpus > 1
(i.e., remove the count-limited replace and perform a global replacement or
regex substitution of the token), ensuring this change is applied where
combined_cmd is built and referencing combined_cmd, task_classes, USE_TORCHRUN,
and job_num_gpus.
- Around line 57-61: The current loop silently falls back to the default
container when a task class sets CONTAINER_KEY that is not present in
cluster_config["containers"]; change the logic so that if a task class defines
key = getattr(tc, "CONTAINER_KEY", None) and key is truthy you must index
cluster_config["containers"] directly and raise a clear error when the key is
missing (e.g., if key not in cluster_config["containers"]: raise KeyError or
ValueError with a message including the missing key and available container
keys), otherwise set container = cluster_config["containers"][key]; this removes
the silent fallback to "nemo-skills" and surfaces misconfiguration.

---

Duplicate comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 448-459: The loop that builds extra_pkg_dirs calls
task_cls.get_extra_package_dirs() without passing env_vars (from
pipeline_utils.get_env_variables(cluster_config)), which misses dirs declared
via cluster-config env vars; update the code in the extra_pkg_dirs construction
to call get_extra_package_dirs with env_vars when available: attempt
task_cls.get_extra_package_dirs(env_vars) and fall back to
task_cls.get_extra_package_dirs() (e.g., catch TypeError) so benchmarks_dict ->
ba.generation_task_class and its get_extra_package_dirs can resolve
cluster-config paths like NEMO_SKILLS_VLMEVALKIT_PATH.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b1d043dc-51e8-4595-b471-cbdf663baccf

📥 Commits

Reviewing files that changed from the base of the PR and between 2116473 and 469eba6.

📒 Files selected for processing (1)
  • nemo_skills/pipeline/eval.py

Comment on lines +51 to +54
# Torchrun for multi-GPU data-parallel inference
if any(getattr(tc, "USE_TORCHRUN", False) for tc in task_classes):
if job_num_gpus and int(job_num_gpus) > 1:
combined_cmd = combined_cmd.replace("python -m ", f"torchrun --nproc_per_node {job_num_gpus} -m ", 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Apply torchrun before combining commands.

combined_cmd.replace("python -m ", ..., 1) only rewrites the first module launch. In sequential/parallel single-node batches, later benchmark commands in the same task still run as plain python -m, so only the first one gets the multi-GPU launcher.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 51 - 54, The current logic only
swaps the first "python -m " to "torchrun ..." because combined_cmd.replace(...,
1) targets a single occurrence, leaving later module launches unwrapped; update
the transformation so every "python -m " in combined_cmd is prefixed with
"torchrun --nproc_per_node {job_num_gpus} -m" when any task class has
USE_TORCHRUN and job_num_gpus > 1 (i.e., remove the count-limited replace and
perform a global replacement or regex substitution of the token), ensuring this
change is applied where combined_cmd is built and referencing combined_cmd,
task_classes, USE_TORCHRUN, and job_num_gpus.

Comment on lines +57 to +61
container = cluster_config["containers"]["nemo-skills"]
for tc in task_classes:
key = getattr(tc, "CONTAINER_KEY", None)
if key and key in cluster_config.get("containers", {}):
container = cluster_config["containers"][key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast when a task class asks for an unknown container.

If CONTAINER_KEY is set but missing from the cluster config, this silently falls back to nemo-skills and the job runs in the wrong image. Since containers is already required here, use direct indexing and raise a clear error instead of masking the misconfiguration.
As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 57 - 61, The current loop silently
falls back to the default container when a task class sets CONTAINER_KEY that is
not present in cluster_config["containers"]; change the logic so that if a task
class defines key = getattr(tc, "CONTAINER_KEY", None) and key is truthy you
must index cluster_config["containers"] directly and raise a clear error when
the key is missing (e.g., if key not in cluster_config["containers"]: raise
KeyError or ValueError with a message including the missing key and available
container keys), otherwise set container = cluster_config["containers"][key];
this removes the silent fallback to "nemo-skills" and surfaces misconfiguration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant