Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions nemo_retriever/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,11 @@ retriever local stage6 run --input-dir <dir>
- `recall_adapter: page_plus_one` (convert zero-indexed `page` CSVs to `pdf_page`)
- `recall_adapter: financebench_json` (convert FinanceBench JSON to `query,expected_pdf`)
- `recall_match_mode: pdf_page|pdf_only` controls recall matching mode.
- Harness presets support `extends:` inheritance so common tuning can live in one base preset.
- Worker-count tuning fields can be set to `auto` in a preset to defer those counts to the batch
runtime heuristics.
- The built-in `auto_workers` preset keeps the conservative `single_gpu` batch sizes while letting
worker/task counts scale from the detected Ray GPU inventory.
- Dataset presets configured under `/datasets/nv-ingest/...` will fall back to `/raid/$USER/...` when the dataset is not present in `/datasets`.
- Relative `query_csv` entries in harness YAML resolve from the config file directory first, then fall back to the repo root.
- The default `financebench` dataset preset now points at `data/financebench_train.json` and enables recall out of the box.
Expand Down Expand Up @@ -259,6 +264,9 @@ Preset dataset name
```bash
# Dataset preset from test_configs.yaml (recall-required example)
retriever harness run --dataset jp20 --preset single_gpu

# Keep the same single-GPU-style batch sizes but let worker counts auto-scale
retriever harness run --dataset jp20 --preset auto_workers
```
or

Expand Down Expand Up @@ -346,6 +354,25 @@ These fields use best-effort discovery and fall back to `null` or `"unknown"` ra

Sweep/nightly sessions additionally write:

6. Presets and heuristic sizing

Use fixed presets such as `single_gpu` and `dgx_8gpu` when you want repeatable, pinned worker counts.
Use `auto_workers` when you want the harness to preserve the same high-level tuning shape but let the
batch runtime discover worker/task counts from the currently available GPUs.

Today, `auto` is intentionally limited to worker/task count fields in
`nemo_retriever/harness/test_configs.yaml`, such as:

- `pdf_extract_workers`
- `page_elements_workers`
- `ocr_workers`
- `embed_workers`

When one of those fields is set to `auto`, the harness serializes the batch CLI sentinel value and
`nemo_retriever.examples.batch_pipeline` falls back to its existing `resolve_requested_plan()`
heuristics at runtime. Batch sizes, CPU-per-task values, and GPU shares remain explicit so the
preset behavior stays easy to reason about for new users.

The `runtime_metrics/` directory contains:

When Slack posting is enabled, the nightly summary is built from `session_summary.json` plus each
Expand Down
8 changes: 6 additions & 2 deletions nemo_retriever/harness/HANDOFF.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,11 @@ It captures what exists now, what was intentionally chosen, and what to iterate

- Active default dataset: `jp20` (recall-required workflow).
- `bo20` remains ingestion-only (`query_csv: null`, `recall_required: false`).
- Two main presets are available:
- Presets now support `extends:` inheritance and selective `auto` tuning values.
- Commonly used presets are:
- `single_gpu`
- `dgx_8gpu`
- `auto_workers` (keeps conservative batch sizes, defers worker counts to runtime heuristics)
- Adapter-capable datasets:
- `earnings` uses `recall_adapter: page_plus_one` (`page` -> `pdf_page` conversion).
- `bo10k` wiring is included (adapter + mode), with recall disabled by default until query path is set.
Expand Down Expand Up @@ -147,6 +149,9 @@ Notes:
- Relative `query_csv` paths are stable across cwd/worktree runs.
- Dataset presets can resolve from `/raid/$USER/...` when `/datasets/nv-ingest/...` is unavailable.
- `financebench` now defaults to `data/financebench_train.json` with recall enabled.
- Preset usability improvement:
- Harness presets can inherit from a shared base preset.
- Worker-count fields can use `auto` so batch mode sizes them from Ray GPU inventory.
- Session UX improvements:
- Runs, sweeps, and nightly sessions accept repeatable `--tag` values persisted into artifacts.
- `retriever harness summary` prints a compact table from `session_summary.json`.
Expand Down Expand Up @@ -225,7 +230,6 @@ shim for that newer upstream behavior.

### P1

- Add preset inheritance or scaling helper to reduce duplicated numeric tuning in YAML.
- Add an artifact retention helper (manual command) to prune old sessions by age/size.
- Consider a single `recall_profile` abstraction (mapping to adapter + match mode) to reduce config coupling
and avoid invalid combinations.
Expand Down
36 changes: 19 additions & 17 deletions nemo_retriever/harness/test_configs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,11 @@ active:
write_detection_file: false

presets:
single_gpu:
pdf_extract_workers: 8
fixed_common:
pdf_extract_num_cpus: 2.0
pdf_extract_batch_size: 4
pdf_split_batch_size: 1
page_elements_batch_size: 4
page_elements_workers: 3
ocr_workers: 3
ocr_batch_size: 16
embed_workers: 3
embed_batch_size: 256
page_elements_cpus_per_actor: 1.0
ocr_cpus_per_actor: 1.0
Expand All @@ -34,23 +29,30 @@ presets:
gpu_ocr: 0.1
gpu_embed: 0.25

single_gpu:
extends: fixed_common
pdf_extract_workers: 8
page_elements_batch_size: 4
page_elements_workers: 3
ocr_workers: 3
embed_workers: 3

dgx_8gpu:
extends: single_gpu
pdf_extract_workers: 64
pdf_extract_num_cpus: 2.0
pdf_extract_batch_size: 4
pdf_split_batch_size: 1
page_elements_batch_size: 32
page_elements_workers: 24
ocr_workers: 24
ocr_batch_size: 16
embed_workers: 24
embed_batch_size: 256
page_elements_cpus_per_actor: 1.0
ocr_cpus_per_actor: 1.0
embed_cpus_per_actor: 1.0
gpu_page_elements: 0.1
gpu_ocr: 0.1
gpu_embed: 0.25

# Keep the same conservative single-GPU batch sizes, but let batch mode
# discover actor/task counts from the current Ray GPU inventory.
auto_workers:
extends: single_gpu
pdf_extract_workers: auto
page_elements_workers: auto
ocr_workers: auto
embed_workers: auto

datasets:
bo20:
Expand Down
86 changes: 80 additions & 6 deletions nemo_retriever/src/nemo_retriever/harness/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,12 @@
DEFAULT_TEST_CONFIG_PATH = NEMO_RETRIEVER_ROOT / "harness" / "test_configs.yaml"
DEFAULT_NIGHTLY_CONFIG_PATH = NEMO_RETRIEVER_ROOT / "harness" / "nightly_config.yaml"
VALID_RECALL_ADAPTERS = {"none", "page_plus_one", "financebench_json"}
AUTO_WORKER_FIELDS = {
"pdf_extract_workers",
"page_elements_workers",
"ocr_workers",
"embed_workers",
}
DEFAULT_NIGHTLY_SLACK_METRIC_KEYS = [
"pages",
"ingest_secs",
Expand Down Expand Up @@ -62,15 +68,15 @@ class HarnessConfig:
embed_model_name: str = "nvidia/llama-nemotron-embed-1b-v2"
write_detection_file: bool = False

pdf_extract_workers: int = 8
pdf_extract_workers: int | None = 8
pdf_extract_num_cpus: float = 2.0
pdf_extract_batch_size: int = 4
pdf_split_batch_size: int = 1
page_elements_batch_size: int = 4
page_elements_workers: int = 3
ocr_workers: int = 3
page_elements_workers: int | None = 3
ocr_workers: int | None = 3
ocr_batch_size: int = 16
embed_workers: int = 3
embed_workers: int | None = 3
embed_batch_size: int = 256
page_elements_cpus_per_actor: float = 1.0
ocr_cpus_per_actor: float = 1.0
Expand Down Expand Up @@ -103,10 +109,16 @@ def validate(self) -> list[str]:

for name in TUNING_FIELDS:
val = getattr(self, name)
if val is None and name in AUTO_WORKER_FIELDS:
continue
if name.startswith("gpu_") and float(val) < 0.0:
errors.append(f"{name} must be >= 0.0")
elif name.endswith("_workers") and int(val) < 1:
errors.append(f"{name} must be >= 1")
elif name.endswith("_batch_size") and int(val) < 1:
errors.append(f"{name} must be >= 1")
elif name.endswith("_num_cpus") and float(val) <= 0.0:
errors.append(f"{name} must be > 0.0")

return errors

Expand All @@ -121,6 +133,12 @@ def _parse_number(value: str) -> int | float:
return int(value)


def _parse_auto_worker_value(value: str) -> int | None:
if str(value).strip().lower() == "auto":
return None
return int(value)


def _resolve_config_path(config_file: str | None, default_path: Path) -> Path:
if config_file is None:
return default_path
Expand All @@ -141,6 +159,39 @@ def _read_yaml_mapping(path: Path) -> dict[str, Any]:
return data


def _resolve_preset_values(
presets: dict[str, Any], preset_name: str, *, resolution_stack: tuple[str, ...] = ()
) -> dict[str, Any]:
preset_key = str(preset_name)
if preset_key not in presets:
return {}

raw_preset = presets[preset_key]
if not isinstance(raw_preset, dict):
raise ValueError(f"Preset '{preset_key}' must be a mapping")

if preset_key in resolution_stack:
cycle = " -> ".join((*resolution_stack, preset_key))
raise ValueError(f"Preset inheritance cycle detected: {cycle}")

preset_values = dict(raw_preset)
parent_name = preset_values.pop("extends", None)
if parent_name is None:
return preset_values
if not isinstance(parent_name, str) or not parent_name.strip():
raise ValueError(f"Preset '{preset_key}' has invalid extends value: {parent_name!r}")
if parent_name not in presets:
raise ValueError(f"Preset '{preset_key}' extends unknown preset '{parent_name}'")

resolved_parent = _resolve_preset_values(
presets,
parent_name,
resolution_stack=(*resolution_stack, preset_key),
)
resolved_parent.update(preset_values)
return resolved_parent


def _resolve_path_like(value: str | None, base_path: Path = REPO_ROOT) -> str | None:
if value is None:
return None
Expand Down Expand Up @@ -210,7 +261,8 @@ def _apply_env_overrides(config_dict: dict[str, Any]) -> None:
}

for key in TUNING_FIELDS:
env_map[f"HARNESS_{key.upper()}"] = (key, _parse_number)
parser = _parse_auto_worker_value if key in AUTO_WORKER_FIELDS else _parse_number
env_map[f"HARNESS_{key.upper()}"] = (key, parser)

for env_key, (cfg_key, parser) in env_map.items():
raw = os.getenv(env_key)
Expand Down Expand Up @@ -241,6 +293,27 @@ def _parse_cli_overrides(overrides: list[str] | None) -> dict[str, Any]:
return parsed


def _normalize_tuning_values(config_dict: dict[str, Any]) -> None:
for name in AUTO_WORKER_FIELDS:
if name not in config_dict:
continue

value = config_dict[name]
if value is None:
config_dict[name] = None
continue

if isinstance(value, str):
raw = value.strip()
if raw.lower() == "auto" or raw == "":
config_dict[name] = None
continue
value = int(raw)

normalized = int(value)
config_dict[name] = None if normalized == 0 else normalized


def load_harness_config(
*,
config_file: str | None = None,
Expand Down Expand Up @@ -292,13 +365,14 @@ def load_harness_config(
dataset_label = Path(str(dataset_ref)).name
merged["dataset_dir"] = str(dataset_ref)

preset_values = dict(presets.get(str(preset_ref), {}))
preset_values = _resolve_preset_values(presets, str(preset_ref))
merged.update(preset_values)
merged.update({k: v for k, v in sweep_data.items() if k not in {"dataset", "preset"}})
merged.update(cli_override_map)
if cli_recall_required is not None:
merged["recall_required"] = cli_recall_required
_apply_env_overrides(merged)
_normalize_tuning_values(merged)

dataset_dir = merged.get("dataset_dir")
if dataset_dir is None:
Expand Down
17 changes: 12 additions & 5 deletions nemo_retriever/src/nemo_retriever/harness/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,12 @@ def _resolve_lancedb_uri(cfg: HarnessConfig, artifact_dir: Path) -> str:
return str(p)


def _command_worker_count_value(value: int | None) -> str:
if value is not None:
return str(value)
return "0"


def _build_command(cfg: HarnessConfig, artifact_dir: Path, run_id: str) -> tuple[list[str], Path, Path, Path]:
runtime_dir = artifact_dir / "runtime_metrics"
runtime_dir.mkdir(parents=True, exist_ok=True)
Expand Down Expand Up @@ -215,7 +221,7 @@ def _build_command(cfg: HarnessConfig, artifact_dir: Path, run_id: str) -> tuple
cfg.recall_match_mode,
"--no-recall-details",
"--pdf-extract-tasks",
str(cfg.pdf_extract_workers),
_command_worker_count_value(cfg.pdf_extract_workers),
"--pdf-extract-cpus-per-task",
str(cfg.pdf_extract_num_cpus),
"--pdf-extract-batch-size",
Expand All @@ -225,13 +231,13 @@ def _build_command(cfg: HarnessConfig, artifact_dir: Path, run_id: str) -> tuple
"--page-elements-batch-size",
str(cfg.page_elements_batch_size),
"--page-elements-actors",
str(cfg.page_elements_workers),
_command_worker_count_value(cfg.page_elements_workers),
"--ocr-actors",
str(cfg.ocr_workers),
_command_worker_count_value(cfg.ocr_workers),
"--ocr-batch-size",
str(cfg.ocr_batch_size),
"--embed-actors",
str(cfg.embed_workers),
_command_worker_count_value(cfg.embed_workers),
"--embed-batch-size",
str(cfg.embed_batch_size),
"--page-elements-cpus-per-actor",
Expand Down Expand Up @@ -554,9 +560,10 @@ def sweep_command(
typer.echo("Sweep dry run:")
for idx, run in enumerate(runs):
tag_text = f" tags={normalized_tags}" if normalized_tags else ""
effective_preset = preset if preset is not None else run.get("preset")
plan_line = (
f" {idx + 1:03d}: name={run.get('name')} "
f"dataset={run.get('dataset')} preset={run.get('preset')}{tag_text}"
f"dataset={run.get('dataset')} preset={effective_preset}{tag_text}"
)
typer.echo(plan_line)
raise typer.Exit(code=0)
Expand Down
Loading
Loading