Conversation
Signed-off-by: Kapil Arya <kapila@nvidia.com>
…ile_csv. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds new TOML configs and SLURM/K8s scenarios; expands AI Dynamo public models (WorkerConfig, WorkerBaseArgs, LMCache, GenAIPerf); refactors Slurm/Kubernetes command generation and shell orchestration; replaces CSV parsing with pandas-based extraction; adds genai-perf helper script; updates tests and sbatch reference data. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryExpanded AI Dynamo workload to support multiple backends (vllm, sglang, sglang_dsr1) with disaggregated prefill/decode workers. Added LMCache integration, GenAI-Perf workload support, constraint validation for DSE, and improved container mounting strategy. Key changes:
Issues found:
Confidence Score: 3/5
Important Files Changed
|
| if td.cmd_args.hf_home_path: | ||
| result.append(f"{td.cmd_args.hf_home_path}:{td.cmd_args.hf_home_path}") |
There was a problem hiding this comment.
Mount path mismatch: test expects system.hf_home_path -> workspace_path/hf_home, but implementation uses cmd_args.hf_home_path -> cmd_args.hf_home_path. Check test_command_gen_strategy_slurm.py:115
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Actionable comments posted: 43
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/cloudai/systems/kubernetes/kubernetes_system.py (1)
361-371:⚠️ Potential issue | 🟠 MajorSame
shell=Truepattern for copying results.Line 370 also uses
shell=Truewith a string command built via" ".join(...). Apply the same list-based approach for consistency and safety.🔒 Proposed fix
- cp_logs_cmd = " ".join( - [ - "kubectl", - "cp", - f"{self.default_namespace}/{frontend_pod}:{genai_perf_results_path}", - str(job.test_run.output_path / "genai-perf"), - ] - ) + cp_logs_cmd = [ + "kubectl", + "cp", + f"{self.default_namespace}/{frontend_pod}:{genai_perf_results_path}", + str(job.test_run.output_path / "genai-perf"), + ] logging.debug(f"Copying genai-perf results with command: {cp_logs_cmd}") - p = subprocess.run(cp_logs_cmd, shell=True, capture_output=True, text=True) + p = subprocess.run(cp_logs_cmd, capture_output=True, text=True) logging.debug(f"Returned code {p.returncode}, stdout: {p.stdout}, stderr: {p.stderr}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cloudai/systems/kubernetes/kubernetes_system.py` around lines 361 - 371, The code builds cp_logs_cmd as a single string and calls subprocess.run with shell=True; change this to pass an argument list to subprocess.run to avoid shell=True: construct a list like ["kubectl","cp", f"{self.default_namespace}/{frontend_pod}:{genai_perf_results_path}", str(job.test_run.output_path / "genai-perf")] and call subprocess.run with that list (remove shell=True) while keeping capture_output=True and text=True; update the debug log to show the list or join it for readability and keep logging the returncode/stdout/stderr from the subprocess result.src/cloudai/workloads/ai_dynamo/ai_dynamo.py (1)
120-129: 🧹 Nitpick | 🔵 TrivialRemove unused classes:
PrefillWorkerArgsandDecodeWorkerArgsare dead code.These empty subclasses are exported in the public API but are never instantiated or referenced anywhere in the codebase. The migration to
WorkerConfig+WorkerBaseArgshas made them obsolete.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py` around lines 120 - 129, Remove the dead empty subclasses PrefillWorkerArgs and DecodeWorkerArgs from src/cloudai/workloads/ai_dynamo/ai_dynamo.py: delete their class definitions (they only subclass WorkerBaseArgs and contain no members) and remove them from any module exports or __all__ if present; ensure any external references are migrated to the canonical types (WorkerConfig or WorkerBaseArgs) and run tests/type-check to confirm nothing imports these names anymore.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@conf/experimental/ai_dynamo/test_scenario/sglang_slurm.toml`:
- Around line 17-31: Scenario metadata fields are inconsistent: the top-level
name ("dynamo_sglang_kvbm"), the test entry test_name ("sglang-Qwen3-0.6B"), and
the model under [Tests.cmd_args.dynamo]
("deepseek-ai/DeepSeek-R1-Distill-Llama-8B") must match; update either the model
to Qwen3-0.6B to match test_name, or rename test_name and/or the top-level name
to reflect DeepSeek-Llama-8B and remove the "kvbm" suffix if KVBM is not used so
that name, test_name, and the model string are consistent.
In `@conf/experimental/ai_dynamo/test_scenario/vllm_kvbm_slurm.toml`:
- Around line 84-85: The two environment entries DYN_KVBM_LEADER_ZMQ_PUB_PORT
and DYN_KVBM_LEADER_ZMQ_ACK_PORT are formatted as
DYN_KVBM_LEADER_ZMQ_PUB_PORT="57001" and DYN_KVBM_LEADER_ZMQ_ACK_PORT="57002"
which is inconsistent with the rest of the [Tests.extra_env_vars] section;
update these two lines to use spaces around the equals sign (e.g.,
DYN_KVBM_LEADER_ZMQ_PUB_PORT = "57001") so they match the existing style and
improve readability.
In `@conf/experimental/ai_dynamo/test/sglang.toml`:
- Line 23: The hf_home_path entry currently contains a hardcoded,
user/cluster-specific path which will break for others; replace the literal
"/lustre/fsw/coreai_tritoninference_triton3/kapila/huggingface" with a generic
placeholder or a shared default (e.g., %HF_HOME% or /opt/shared/huggingface) in
the hf_home_path setting so it matches the vllm.toml convention and is portable
across environments; update any documentation or comments nearby to indicate the
expected placeholder semantics.
- Line 28: The backend field in the TOML is incorrectly set to "vllm" causing
the sglang-specific configuration to be skipped; update backend = "vllm" to
backend = "sglang" in the conf/experimental/ai_dynamo/test/sglang.toml so the
code path that checks if backend == "sglang" will include the sglang arguments
(deepep.json path, sgl_http_server.py path) and align with the docker image and
worker commands for sglang.
In `@conf/experimental/ai_dynamo/test/vllm.toml`:
- Line 35: The etcd-cmd configuration value contains a trailing space causing
the command string to include an unintended space; remove the trailing
whitespace from the etcd-cmd value (the key name: etcd-cmd) in the vllm.toml
(and similarly adjust sglang.toml Line 35) so the string ends with "/tmp/etcd"
without the extra space.
In `@src/cloudai/systems/kubernetes/kubernetes_system.py`:
- Around line 307-309: The current subprocess.run call builds a shell command
string (cp_wrapper_cmd) using f-strings with wrapper_script_path, frontend_pod,
pod_wrapper_path and self.default_namespace and calls it with shell=True (in
kubernetes_system.KubernetesSystem or relevant method) which is a command
injection risk; change it to pass the command as a list of arguments to
subprocess.run (e.g., ["kubectl","cp", wrapper_script_path,
f"{self.default_namespace}/{frontend_pod}:{pod_wrapper_path}"]) and remove
shell=True while keeping capture_output=True, text=True, and check=True so
arguments are not interpreted by a shell. Ensure you update references to
cp_wrapper_cmd and the subprocess.run invocation accordingly.
- Around line 311-319: A NameError will occur because chmod_cmd is not defined
before being used to build kubectl_exec_cmd; define chmod_cmd as the kubectl
chmod command array (e.g. ["chmod", "+x", "<wrapper_target_path>"]) immediately
before constructing kubectl_exec_cmd, using the same pod destination path
variable used earlier when the wrapper was copied into the pod (e.g.
wrapper_dest_path or the variable that holds the target path). Ensure chmod_cmd
is in the same scope as kubectl_exec_cmd (inside the same method), and keep the
existing logging and subprocess.run usage unchanged so the exec attempt logs
stdout/stderr and handles exceptions.
- Around line 322-331: The code is reading tdef.cmd_args.genai_perf.args (which
is empty) instead of the GenAIPerf workload's extra fields; change the logic
that builds genai_perf_cmd_parts to pull fields directly from
tdef.cmd_args.genai_perf (e.g., call model_dump()/dict() on that object and
iterate its items) while excluding standard Workload keys like name, cmd,
script, args, and any other base fields—append each remaining key/value as
--key=value just like the SLURM strategy does; keep the existing handling for
genai_perf.extra_args (string vs list) afterwards using the same extra_args
logic already present.
In `@src/cloudai/systems/slurm/slurm_command_gen_strategy.py`:
- Around line 52-53: Move the two fixed strings currently set as instance
attributes in __init__ (container_install_path and container_results_path) to
class-level constants (e.g., CONTAINER_INSTALL_PATH and CONTAINER_RESULTS_PATH)
on the SlurmCommandGenStrategy class, then have __init__ assign the instance
attributes from those constants so per-instance overrides still work; update any
references to use the instance attributes
(container_install_path/container_results_path) so behavior is unchanged while
the defaults are visible at the class level.
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py`:
- Around line 264-268: Replace the parameterless methods success_marker() and
failure_marker() with class-level constants (e.g., SUCCESS_MARKER =
"success-marker.txt" and FAILURE_MARKER = "failure-marker.txt") on the same
class that defines those methods; remove the old methods and update all call
sites that use instance.success_marker() or instance.failure_marker() to
reference the constants (ClassName.SUCCESS_MARKER or instance.SUCCESS_MARKER) to
preserve behavior and avoid call overhead. Ensure any imports or references in
the module still resolve after renaming and run tests to confirm no breakage.
- Around line 309-317: The commented-out self.hf_model in the list returned by
the method that builds the installable result is a leftover debug artifact;
either re-enable it by restoring self.hf_model into the returned list (alongside
self.docker_image, self.script, self.genai_perf_script, *self.git_repos) if the
HuggingFace model should be downloaded, or remove the commented line entirely
and, if leaving it out intentionally, add a one-line comment explaining why
self.hf_model is excluded so future readers understand the decision.
- Line 327: Locate the invalid `# noqa` directive in the same block that returns
JobStatusResult(...) (the line using failure_marker.read_text()); either remove
the `# noqa` entirely if it’s unnecessary or replace it with an explicit,
comma-separated list of flake8 codes (e.g., `# noqa: E501`) that correctly
suppress the specific lint errors for that expression, then re-run the linter to
confirm the directive is valid.
- Around line 332-346: The loop currently uses a membership test that treats
falsy values as missing and an assert that can be disabled; change the check
from "if not workload_map.get(workload)" to an explicit key membership test "if
workload not in workload_map" to correctly detect missing keys even when values
are falsy, and replace "assert report_name is not None" with a defensive guard
that logs/raises a clear error or marks the job as failed (e.g., log an error
mentioning the workload and set result = False) before using report_name; keep
using workload_map, report_name, cmd_args.workloads and return
JobStatusResult(result) behavior.
- Around line 200-205: The GitRepo instance assigned to variable "repo" uses the
SSH URL "git@github.com:LMCache/LMCache.git"; change this to the HTTPS URL
"https://github.com/LMCache/LMCache.git" in the GitRepo(...) invocation so it
matches other GitRepo usages (e.g., the dynamo repo) and avoids SSH auth
issues—leave the other fields (commit, mount_as) unchanged.
- Around line 159-165: The prefill_worker field in ai_dynamo.py must be Optional
so aggregated mode remains falsy; change the declaration of prefill_worker from
WorkerConfig = Field(default_factory=WorkerConfig) to Optional[WorkerConfig]
with default None (use Field(default=None) and add typing.Optional import) so
downstream checks like constraint_check (prefill_nodes =
prefill_worker.num_nodes if prefill_worker else 0), test_gen_prefill, and the if
prefill_worker: branches in slurm_command_gen_strategy.py and
kubernetes_json_gen_strategy.py continue to work correctly as configured vs not
configured.
- Around line 348-366: In constraint_check, avoid TypeError when
tensor_parallel_size or pipeline_parallel_size are None by coercing those fields
to safe defaults (e.g., 1) before comparing or multiplying; update the reads for
prefill_tp, decode_tp, prefill_pp, decode_pp (derived from
prefill_worker.args.tensor_parallel_size and .pipeline_parallel_size, and
decode_worker.args.*) to use a fallback like "value if value is not None else 1"
(or a common helper) so comparisons in constraint_check and the tp*pp
multiplications never operate on None.
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.sh`:
- Around line 8-9: The script currently hardcodes DONE_MARKER and
FATAL_ERROR_MARKER which can drift from Python's
success_marker()/failure_marker() values; change ai_dynamo.sh to read the marker
filenames from the existing CLI flags (--success-marker and --failure-marker)
instead of the hardcoded DONE_MARKER="./success-marker.txt" and
FATAL_ERROR_MARKER="./failure-marker.txt", and use those variables wherever
DONE_MARKER or FATAL_ERROR_MARKER are referenced so the shell and Python share a
single source of truth.
- Around line 880-896: The function _query_frontend is defined but never used
(SC2329); either remove it if it's only a leftover debugging helper, or
integrate it into the call flow by invoking _query_frontend where chat
completions are produced (for example from the main request/dispatch function
that uses dynamo_args and RESULTS_DIR), and/or document it as a utility with a
clear comment showing intended usage and expected parameters (model in
dynamo_args, RESULTS_DIR path); ensure any invocation passes the content and
max_tokens arguments and that curl calls use the same configuration as the rest
of the module.
- Around line 960-963: The setup currently runs a runtime install with "uv pip
install -e $lmcache_path" (see the setup_lmcache section) which is unsafe and
unquoted; instead, remove the runtime install and require LMCache be
pre-installed during image build or the script's install phase (or add an
explicit installation step run at image build time), and if you must keep
runtime fallback, at minimum quote the path variable ("$lmcache_path"), check
for writability and existing installation before attempting install, and emit a
clear error instructing operators to rebuild the image if installation fails;
update the log and the call site (uv pip install -e $lmcache_path) accordingly.
- Around line 269-287: The assignments to ENABLE_LMCACHE and ENABLE_KVBM in
_apply_connector_settings are unused; remove those two variable assignments (the
ENABLE_LMCACHE=1 and ENABLE_KVBM=1 lines) from the _apply_connector_settings
function, or alternatively if you intend them to be used by subprocesses, export
them (export ENABLE_LMCACHE and export ENABLE_KVBM) or wire them into downstream
logic—update the _apply_connector_settings function (which calls _has_connector)
accordingly so there are no unused variables left.
- Around line 151-156: Remove the duplicate case arm handling the `--workloads)`
pattern inside the case that populates the `dynamo_args` associative array:
there are two `--workloads)` branches; delete the second occurrence so
`dynamo_args["workloads"]="$2"` is only set once and no unreachable branch
remains. Ensure remaining case arms (e.g., `--dynamo-*`) and any
`shift`/argument consumption logic still align after removing the duplicate.
- Around line 1077-1080: The script does an unguarded cd $RESULTS_DIR which can
silently fail; update the cd invocation in ai_dynamo.sh to use a quoted variable
and handle failure (e.g., cd "$RESULTS_DIR" || exit 1) so the script stops if
the directory change fails; apply the same quoting/guard pattern to other cd
uses around RESULTS_DIR and workspace-path (e.g., cd
"${dynamo_args["workspace-path"]}" || exit 1) to prevent continuing in the wrong
directory.
- Around line 456-462: The unquoted use of kill $(jobs -p) can mis-handle empty
output or whitespace; capture the PIDs from jobs -p into a shell array or
variable, check whether it is non-empty, and only then call kill with the PID
list using proper array expansion and the safe option delimiter (i.e., use kill
-- with the expanded PID array/variable), preserving the 2>/dev/null || true
behavior and leaving perform_exit 1 unchanged; locate the existing kill
invocation (and surrounding cleanup block) to implement this guarded-kill
pattern using jobs -p as the source.
- Around line 829-844: In launch_prefill the array and string expansions are
unquoted which breaks arguments containing spaces; update expansions to use
proper quoted forms: replace ${args_arr[*]} and ${args_arr[@]} with the quoted
array expansion "${args_arr[@]}", and quote any config string expansions like
${prefill_config["extra-args"]} and the command expansion
${prefill_config["cmd"]} as "${prefill_config["cmd"]}" and
"${prefill_config["extra-args"]}" so each argument is preserved; locate these in
the launch_prefill block around the log and the CUDA_VISIBLE_DEVICES command and
apply the quoted-expansion fix.
- Around line 774-789: The array construction and expansions are unquoted which
causes word-splitting and corrupts multi-word arguments: change args_arr+=($key
$(replace_placeholders "${decode_args[$key]}")) to use quoted elements (e.g.,
args_arr+=("$key" "$(replace_placeholders "${decode_args[$key]}")")) and always
expand the array as "${args_arr[@]}"; also quote command and extra-args when
invoking the process (use "${decode_config["cmd"]}" and
"${decode_config["extra-args"]}" or split extra-args into an array) so
paths/commands with spaces aren't broken; apply the same quoting fixes to the
prefill launch path where args_arr, replace_placeholders, decode_config["cmd"]
and decode_config["extra-args"] are used, and keep redirection to "$log_file"
quoted as well.
In `@src/cloudai/workloads/ai_dynamo/genai_perf.sh`:
- Around line 93-103: The log call in genai_perf.sh uses Python-style triple
quotes ("""...""") which is invalid/fragile in bash; replace that invocation of
log with a proper bash multiline string (for example use a heredoc: log <<'EOF'
... EOF) or build the message via printf/echo and pass it to log, preserving the
same interpolations (model, url, port, result_dir, install_dir, report_name,
cmd, extra_args and the genai_perf_args loop) so the output remains identical;
update the log invocation that references genai_perf_args and the variables
listed to use the chosen heredoc or printf approach.
- Around line 84-86: The case arm matching `--*)` currently does `shift 2` which
assumes every unknown long flag has a value and can misalign parsing if the flag
is boolean; update the `--*)` handling in the case statement in genai_perf.sh to
inspect the next token (e.g., `$2`/peek at `$1` after shift) and only consume
the next argument when it does not start with `-` (otherwise only `shift 1`), so
boolean unknown flags consume a single token and two-part flags consume two.
- Around line 112-123: The grep results used to populate
output_tokens_per_second, request_throughput, and request_count can be empty and
cause silent wrong arithmetic and bad report lines; update the block that sets
profile_path/output_tokens_per_second/output_tokens_per_second_per_gpu/request_throughput/request_count
to validate each grep result (e.g., check the variable is non-empty after each
grep) before using it, and if any are empty either log an error and skip writing
the derived lines or set a safe default and mark the report as invalid;
specifically add guards around the assignments to output_tokens_per_second,
request_throughput, and request_count and only compute
output_tokens_per_second_per_gpu (division by total_gpus) when
output_tokens_per_second is numeric, otherwise handle the error path (logging
and/or early return) so echo lines appending to "$result_dir/$report_name" do
not write garbage.
- Around line 146-152: The script currently builds a concatenated string
full_cmd="$cmd $cmdline_args" and executes it with eval "$full_cmd", which is a
command injection risk when genai_perf_args/extra_args come from user TOML;
replace this by constructing a safe argv array (e.g., start with the command
variable cmd and append each argument as separate elements from
cmdline_args/extra_args without word-splitting) and then execute that array
directly (no eval) so shell metacharacters in JSON or other inputs are passed as
literal arguments; update the launch site (the log "Launching genai-perf..." and
the eval call) to use the array execution and ensure any JSON like
{"temperature": 0.7, "max_tokens": 128} is treated as a single quoted argument
when appended.
- Around line 6-13: Remove the unused shell variables repo, endpoint, and
report_file to fix SC2034: delete their declarations (repo at top, endpoint
assignment around where --endpoint is parsed, and report_file at the end) and
also remove the corresponding --endpoint case in process_args so there is no
leftover parsing for endpoint; if endpoint functionality is intended instead,
wire the parsed value into the actual usage instead of leaving it unused (look
for variables named repo, endpoint, report_file and the process_args --endpoint
branch to update).
In `@src/cloudai/workloads/ai_dynamo/report_generation_strategy.py`:
- Around line 34-44: The extract_metric_from_csv function currently assumes a
"Metric" column exists and can raise KeyError; add a validation that "Metric" is
in df.columns (similar to the metric_type check) before accessing df["Metric"],
log a clear message and return METRIC_ERROR if missing, and then proceed with
the existing metric_type and metric_name checks so all column accesses are
guarded.
- Around line 46-53: The get_metric method currently does unsafe unpacking with
metric.split(":"), which raises ValueError on malformed input; update get_metric
to call metric.split(":", 2) (or split with maxsplit=2), then defensively check
the resulting list length and assign benchmark_name, metric_name, metric_type
from the parts with sensible defaults (e.g., keep "genai_perf" and "avg" if
parts missing) and/or log and raise a clear error for invalid formats; reference
the get_metric function and the benchmark_name/metric_name/metric_type variables
to locate where to add the split(maxsplit=2) and the length checks.
- Around line 27-28: Remove the no-op TYPE_CHECKING block: delete the "if
TYPE_CHECKING: pass" construct in report_generation_strategy.py so the module no
longer contains an empty type-check guard; ensure no other code relied on that
block (search for TYPE_CHECKING references) and run tests/type-checks to confirm
nothing breaks.
- Line 23: Remove the module-level "import pandas as pd" and instead use the
project's lazy importer from src/cloudai/util/lazy_imports.py (accessed as
lazy.pd) where pandas is needed; update places that call pd.read_csv(...) (and
any other pd.* uses) to call lazy.pd.read_csv(...) or assign local pd = lazy.pd
inside the function/method that needs it (e.g., the report generation function
in report_generation_strategy.py) to satisfy the TID253 lint rule.
In `@src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py`:
- Around line 85-88: td.cmd_args.storage_cache_dir may be a list[str] but the
code appends it as a single string to args, producing a malformed shell
argument; update the Slurm command generation (the args list in
slurm_command_gen_strategy.py where td.cmd_args.storage_cache_dir is used) to
detect if storage_cache_dir is iterable and, if so, append a separate
"--storage-cache-dir <path>" entry for each item in the list, otherwise append
the single "--storage-cache-dir <path>" entry — ensure you reference
td.cmd_args.storage_cache_dir and the args list when making the change.
- Around line 67-73: The hasattr(base_model, "args") check in
_get_nested_toml_args is redundant because getattr(base_model, "args", None)
will return None if the attribute is missing; remove the hasattr(...) condition
and simply use nested_args = getattr(base_model, "args", None) with a single
null check (if nested_args is not None) before calling
self._get_toml_args(nested_args, prefix + "args-") so the function behavior is
unchanged but the guard is simplified.
- Around line 39-41: td.cmd_args.storage_cache_dir can be a list, so update the
handling in slurm_command_gen_strategy: where you currently do
result.append(f"{td.cmd_args.storage_cache_dir}:{td.cmd_args.storage_cache_dir}")
(container mounts), check if storage_cache_dir is a list and for each entry
append f"{dir}:{dir}" to result; and where you build the CLI flag (the
"--storage-cache-dir ..." piece), if storage_cache_dir is a list emit one
"--storage-cache-dir <dir>" per item (or otherwise join into multiple flags)
instead of interpolating the list as a string; reference
td.cmd_args.storage_cache_dir and the code paths that append mounts and CLI args
and replace single f-string usage with list-aware logic.
In `@tests/ref_data/ai-dynamo.sbatch`:
- Around line 17-26: The srun command is getting a duplicate --export=ALL
because self.gen_srun_prefix() already includes that flag; remove the explicit
'--export=ALL' token appended on the line that follows the call to
self.gen_srun_prefix() in the Slurm command generation method (the code that
calls self.gen_srun_prefix() and then adds '--export=ALL'). Update the method in
src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py to rely on
self.gen_srun_prefix() only (do not append another '--export=ALL') so the
generated srun arguments are not duplicated.
- Line 70: The `_get_toml_args` implementation is wrapping TOML values with
unescaped double quotes which breaks JSON arguments (e.g. genai_perf extra
inputs); update `_get_toml_args` in slurm_command_gen_strategy.py to detect
JSON-like values (strings that start with "{" and end with "}") and either wrap
them with single quotes instead of double quotes or escape inner double quotes
before surrounding with double quotes so the produced flag becomes
--genai_perf-extra-inputs '{"temperature": 0.7, "max_tokens": 128}' (or an
escaped equivalent); ensure other non-JSON values keep the original quoting
behavior.
In `@tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py`:
- Around line 109-116: The test fails because test_container_mounts expects an
hf_home mount derived from strategy.system.hf_home_path while
AIDynamoCmdArgs.hf_home_path in the fixture is None and _container_mounts
currently only adds a mount when td.cmd_args.hf_home_path is truthy; update
_container_mounts (in slurm_command_gen_strategy.py) to use
td.cmd_args.hf_home_path if provided, otherwise fall back to
strategy.system.hf_home_path when building the hf_home mount so the test’s
expected mount is produced.
In `@tests/workloads/ai_dynamo/test_report_gen_strategy.py`:
- Around line 84-89: The test uses a trivial assert True and doesn't verify
behavior of AIDynamoReportGenerationStrategy.generate_report; update
test_ai_dynamo_generate_report to assert a concrete postcondition: call
strategy.generate_report() and then verify either (a) no unexpected files were
created by checking the test temp directory (use pytest tmp_path or inspect the
working dir) or (b) expected state changes on the ai_dynamo_tr/TestRun object
(e.g., a report_generated flag, report_path, or metadata updated) or (c)
expected interactions with slurm_system (spy/monkeypatch methods and assert they
were called). Pick the side effect that matches your implementation (inspect
AIDynamoReportGenerationStrategy.generate_report, TestRun fields, and
SlurmSystem methods) and replace assert True with an assertion that checks that
side effect.
- Around line 115-117: The test writes an empty CSV to the wrong filename (uses
genai_perf-report.csv with a dash) so it doesn't exercise the error path in
strategy.get_metric which reads genai_perf_report.csv (underscore); change the
write call in tests/workloads/ai_dynamo/test_report_gen_strategy.py to write to
(ai_dynamo_tr.output_path / "genai_perf_report.csv") before asserting
strategy.get_metric("invalid-metric") == METRIC_ERROR so the empty-file
error-handling in report_generation_strategy.py and the METRIC_ERROR check are
actually exercised.
---
Outside diff comments:
In `@src/cloudai/systems/kubernetes/kubernetes_system.py`:
- Around line 361-371: The code builds cp_logs_cmd as a single string and calls
subprocess.run with shell=True; change this to pass an argument list to
subprocess.run to avoid shell=True: construct a list like ["kubectl","cp",
f"{self.default_namespace}/{frontend_pod}:{genai_perf_results_path}",
str(job.test_run.output_path / "genai-perf")] and call subprocess.run with that
list (remove shell=True) while keeping capture_output=True and text=True; update
the debug log to show the list or join it for readability and keep logging the
returncode/stdout/stderr from the subprocess result.
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py`:
- Around line 120-129: Remove the dead empty subclasses PrefillWorkerArgs and
DecodeWorkerArgs from src/cloudai/workloads/ai_dynamo/ai_dynamo.py: delete their
class definitions (they only subclass WorkerBaseArgs and contain no members) and
remove them from any module exports or __all__ if present; ensure any external
references are migrated to the canonical types (WorkerConfig or WorkerBaseArgs)
and run tests/type-check to confirm nothing imports these names anymore.
| name = "dynamo_sglang_kvbm" | ||
|
|
||
| [[Tests]] | ||
| id = "sglang-Qwen3-0.6B" | ||
| test_name = "sglang-Qwen3-0.6B" | ||
| time_limit = "00:20:00" | ||
|
|
||
| extra_container_mounts = ["/run/udev:/run/udev", "/tmp:/tmp"] | ||
|
|
||
| [Tests.cmd_args] | ||
| num_nodes = 2 # 1 prefill node + 1 decode node | ||
| workloads = "genai_perf.sh" | ||
|
|
||
| [Tests.cmd_args.dynamo] | ||
| model = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" #Qwen/Qwen3-0.6B" |
There was a problem hiding this comment.
Mismatches between scenario name, test name, and model.
Three inconsistencies:
name = "dynamo_sglang_kvbm"(Line 17) references "kvbm", but no KVBM configuration is present.test_name = "sglang-Qwen3-0.6B"(Line 21) references Qwen3-0.6B, but the actual model on Line 31 is"deepseek-ai/DeepSeek-R1-Distill-Llama-8B".
These mismatches will confuse users when selecting and interpreting test scenarios.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@conf/experimental/ai_dynamo/test_scenario/sglang_slurm.toml` around lines 17
- 31, Scenario metadata fields are inconsistent: the top-level name
("dynamo_sglang_kvbm"), the test entry test_name ("sglang-Qwen3-0.6B"), and the
model under [Tests.cmd_args.dynamo] ("deepseek-ai/DeepSeek-R1-Distill-Llama-8B")
must match; update either the model to Qwen3-0.6B to match test_name, or rename
test_name and/or the top-level name to reflect DeepSeek-Llama-8B and remove the
"kvbm" suffix if KVBM is not used so that name, test_name, and the model string
are consistent.
tests/ref_data/ai-dynamo.sbatch
Outdated
| srun \ | ||
| --export=ALL \ | ||
| --mpi=pmix \ | ||
| -N2 \ | ||
| --container-image=nvcr.io/nvidia/ai-dynamo:24.09 \ | ||
| --container-mounts=__OUTPUT_DIR__/output:/cloudai_run_results,__OUTPUT_DIR__/install:/cloudai_install,__OUTPUT_DIR__/output,__INSTALL_DIR__:__INSTALL_DIR__,__INSTALL_DIR__/huggingface:/root/.cache/huggingface,__CLOUDAI_DIR__/src/cloudai/workloads/ai_dynamo/ai_dynamo.sh:__CLOUDAI_DIR__/src/cloudai/workloads/ai_dynamo/ai_dynamo.sh \ | ||
| --container-mounts=__OUTPUT_DIR__/output:/cloudai_run_results,__OUTPUT_DIR__/install:/cloudai_install,__OUTPUT_DIR__/output,__INSTALL_DIR__:/git/dynamo,__INSTALL_DIR__/LMCache__ab8530993992db873869ba882320953582d94309:/git/LMCache \ | ||
| --nodes=2 \ | ||
| --ntasks=2 \ | ||
| --ntasks-per-node=1 \ | ||
| --export=ALL,DYNAMO_FATAL_ERROR_FILE=fatal_error.marker \ | ||
| --export=ALL \ |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check the ai-dynamo.sbatch file to confirm the duplicate --export=ALL
cat -n tests/ref_data/ai-dynamo.sbatch | head -35Repository: NVIDIA/cloudai
Length of output: 2351
🏁 Script executed:
#!/bin/bash
# Find code that generates sbatch files or handles --export flags
find . -type f -name "*.py" | head -20
rg -l "sbatch|--export" --type=py | head -15Repository: NVIDIA/cloudai
Length of output: 1525
🏁 Script executed:
#!/bin/bash
# Look for the code generator strategy mentioned in the comment
rg "export=ALL" --type=py -B5 -A5Repository: NVIDIA/cloudai
Length of output: 25166
🏁 Script executed:
cat -n src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.pyRepository: NVIDIA/cloudai
Length of output: 11653
🏁 Script executed:
# Look at base class to understand method interactions
head -100 src/cloudai/systems/slurm/slurm_command_gen_strategy.pyRepository: NVIDIA/cloudai
Length of output: 3938
🏁 Script executed:
grep -n "gen_srun_prefix" src/cloudai/systems/slurm/slurm_command_gen_strategy.py -A10 | head -20Repository: NVIDIA/cloudai
Length of output: 1278
🏁 Script executed:
# Check if there are any tests that might be expecting the duplicate
rg "export=ALL.*export=ALL" --type pyRepository: NVIDIA/cloudai
Length of output: 40
🏁 Script executed:
# Check git history to see if this was intentional or accidental
cd src/cloudai/workloads/ai_dynamo && git log -p --all -S '--export=ALL' -- slurm_command_gen_strategy.py 2>/dev/null | head -100Repository: NVIDIA/cloudai
Length of output: 40
Remove duplicate --export=ALL flag from ai_dynamo srun command generation.
Line 137 in src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py adds --export=ALL explicitly, but this flag is already included in the result of self.gen_srun_prefix() called on line 130. Remove the duplicate from line 137.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/ref_data/ai-dynamo.sbatch` around lines 17 - 26, The srun command is
getting a duplicate --export=ALL because self.gen_srun_prefix() already includes
that flag; remove the explicit '--export=ALL' token appended on the line that
follows the call to self.gen_srun_prefix() in the Slurm command generation
method (the code that calls self.gen_srun_prefix() and then adds
'--export=ALL'). Update the method in
src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py to rely on
self.gen_srun_prefix() only (do not append another '--export=ALL') so the
generated srun arguments are not duplicated.
| --genai_perf-script "genai_perf.sh" \ | ||
| --genai_perf-report-name "genai_perf_report.csv" \ | ||
| --genai_perf-streaming "True" \ | ||
| --genai_perf-extra-inputs "{"temperature": 0.7, "max_tokens": 128}" \ |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's check the actual file content
echo "=== Checking line 70 of the file ==="
sed -n '65,75p' tests/ref_data/ai-dynamo.sbatch
echo -e "\n=== Searching for extra.inputs / extra_inputs command generation ==="
rg -n 'extra.?inputs' --type=py -C5 -iRepository: NVIDIA/cloudai
Length of output: 4219
🏁 Script executed:
#!/bin/bash
# Find ai-dynamo command generation code
echo "=== Searching for ai-dynamo command generation strategy ==="
find . -name "*ai*dynamo*" -type f | grep -E '\.(py|sh)$'
echo -e "\n=== Searching for GenAIPerf command building code ==="
rg -n 'genai_perf.*command|GenAIPerf' --type=py -B2 -A5 | head -100
echo -e "\n=== Searching for extra-inputs parameter handling ==="
rg -n 'extra-inputs|extra_inputs' --type=py -B3 -A3Repository: NVIDIA/cloudai
Length of output: 9835
🏁 Script executed:
#!/bin/bash
# Find AI Dynamo command generation strategy
echo "=== Finding AI Dynamo SLURM command generation ==="
find . -path "./src/cloudai/workloads/ai_dynamo*" -name "*.py" -type f | xargs ls -la
echo -e "\n=== Searching for slurm_command_gen_strategy in ai_dynamo ==="
find . -path "*/ai_dynamo/*" -name "*slurm*command*" -type f
echo -e "\n=== Looking at ai_dynamo.py for command generation ==="
grep -n "def.*generate\|extra.inputs\|to_command" src/cloudai/workloads/ai_dynamo/ai_dynamo.pyRepository: NVIDIA/cloudai
Length of output: 828
🏁 Script executed:
#!/bin/bash
# Read the AI Dynamo SLURM command generation strategy
echo "=== AI Dynamo SLURM Command Generation Strategy ==="
cat -n src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.pyRepository: NVIDIA/cloudai
Length of output: 11705
🏁 Script executed:
#!/bin/bash
# Check the GenAIPerf class definition
echo "=== GenAIPerf class in ai_dynamo.py ==="
sed -n '215,240p' src/cloudai/workloads/ai_dynamo/ai_dynamo.py
echo -e "\n=== Test to see what happens with the quoting ==="
python3 << 'EOF'
# Simulate what the _get_toml_args method does with the extra-inputs value
value = '{"temperature": 0.7, "max_tokens": 128}'
prefix = "--genai_perf-"
k = "extra-inputs"
# Current (buggy) behavior
buggy = f'{prefix}{k} "{value}"'
print("Current (buggy) output:")
print(f" {buggy}")
print()
# This is what gets written to sbatch and then executed
print("When bash tries to parse this line:")
print(f" srun ... {buggy} \\")
print()
print("Bash will see:")
print(' --genai_perf-extra-inputs "{\"temperature\": 0.7, \"max_tokens\": 128}"')
print()
print("The issue: inner double quotes break the quoted string")
EOF
echo -e "\n=== Checking if shell parsing would fail ==="
bash -n -c 'echo --genai_perf-extra-inputs "{"temperature": 0.7, "max_tokens": 128}"' 2>&1 || echo "Shell syntax error confirmed!"Repository: NVIDIA/cloudai
Length of output: 1433
🏁 Script executed:
#!/bin/bash
# More direct shell syntax test
echo "=== Direct shell syntax test ==="
# Test 1: Try to execute the actual problematic line format
bash -c 'set -x; echo --genai_perf-extra-inputs "{"temperature": 0.7, "max_tokens": 128}"' 2>&1 | head -20
echo ""
echo "=== Test if the sbatch line itself has valid syntax ==="
# Extract and test the exact line from sbatch
bash -n <<'TESTSCRIPT'
--genai_perf-extra-inputs "{"temperature": 0.7, "max_tokens": 128}" \
TESTSCRIPT
echo "Exit code: $?"Repository: NVIDIA/cloudai
Length of output: 293
🏁 Script executed:
#!/bin/bash
# Let's test what actually gets passed to the command
echo "=== Test what argument value is actually received ==="
# Create a test script that prints its arguments
cat > /tmp/test_args.sh << 'EOF'
#!/bin/bash
echo "Argument count: $#"
for i in "$@"; do
echo " Arg: [$i]"
done
EOF
chmod +x /tmp/test_args.sh
echo "Test 1: With double quotes containing unescaped inner quotes (current sbatch format)"
bash -c '/tmp/test_args.sh --genai_perf-extra-inputs "{"temperature": 0.7, "max_tokens": 128}"'
echo ""
echo "Test 2: With single quotes (what test_acceptance.py uses)"
bash -c '/tmp/test_args.sh --genai_perf-extra-inputs '"'"'{"temperature": 0.7, "max_tokens": 128}'"'"''
echo ""
echo "Test 3: With properly escaped double quotes"
bash -c '/tmp/test_args.sh --genai_perf-extra-inputs "{\"temperature\": 0.7, \"max_tokens\": 128}"'Repository: NVIDIA/cloudai
Length of output: 602
Fix JSON parameter quoting in genai_perf command generation.
The --genai_perf-extra-inputs value on line 70 of the sbatch is shell-unsafe. When bash parses --genai_perf-extra-inputs "{"temperature": 0.7, "max_tokens": 128}", the inner double quotes break the quoted string and the JSON is corrupted to {temperature: 0.7, max_tokens: 128} (quotes stripped, content mangled).
The bug is in src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py at lines 61 and 63 in the _get_toml_args method, which wraps values in unescaped double quotes. For JSON values, use single quotes instead: --genai_perf-extra-inputs '{"temperature": 0.7, "max_tokens": 128}' or escape the inner quotes: --genai_perf-extra-inputs "{\"temperature\": 0.7, \"max_tokens\": 128}".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/ref_data/ai-dynamo.sbatch` at line 70, The `_get_toml_args`
implementation is wrapping TOML values with unescaped double quotes which breaks
JSON arguments (e.g. genai_perf extra inputs); update `_get_toml_args` in
slurm_command_gen_strategy.py to detect JSON-like values (strings that start
with "{" and end with "}") and either wrap them with single quotes instead of
double quotes or escape inner double quotes before surrounding with double
quotes so the produced flag becomes --genai_perf-extra-inputs '{"temperature":
0.7, "max_tokens": 128}' (or an escaped equivalent); ensure other non-JSON
values keep the original quoting behavior.
| def test_container_mounts(strategy: AIDynamoSlurmCommandGenStrategy, test_run: TestRun) -> None: | ||
| mounts = strategy._container_mounts() | ||
| td = cast(AIDynamoTestDefinition, test_run.test) | ||
| dynamo_repo_path = td.dynamo_repo.installed_path | ||
| assert dynamo_repo_path is not None, "dynamo_repo_path should be set in the test fixture" | ||
|
|
||
| # _container_mounts returns custom mounts including scripts and HF home (git repos are handled by base class) | ||
| assert mounts == [ | ||
| f"{dynamo_repo_path!s}:{dynamo_repo_path!s}", | ||
| f"{strategy.system.hf_home_path.absolute()!s}:{td.cmd_args.huggingface_home_container_path!s}", | ||
| f"{td.script.installed_path.absolute()!s}:{td.script.installed_path.absolute()!s}", | ||
| f"{strategy.system.hf_home_path.absolute()!s}:{td.cmd_args.dynamo.workspace_path}/hf_home", | ||
| ] |
There was a problem hiding this comment.
Pipeline failure: test_container_mounts expects hf_home_path mount but cmd_args fixture leaves it None.
The cmd_args fixture (Line 38) does not set hf_home_path on AIDynamoCmdArgs (defaults to None). Meanwhile, _container_mounts() in slurm_command_gen_strategy.py Line 36 only adds the mount when td.cmd_args.hf_home_path is truthy. The test assertion at Line 115 expects a mount derived from strategy.system.hf_home_path, which is a different source.
Either the fixture needs to supply hf_home_path, or the implementation should fall back to system.hf_home_path.
Option A: Fix the fixture to supply hf_home_path
def cmd_args() -> AIDynamoCmdArgs:
return AIDynamoCmdArgs(
docker_image_url="url",
+ hf_home_path="/some/hf/home",
dynamo=AIDynamoArgs(Option B: Update the test expectation to match the new behavior (empty mounts)
- # _container_mounts returns custom mounts including scripts and HF home (git repos are handled by base class)
- assert mounts == [
- f"{strategy.system.hf_home_path.absolute()!s}:{td.cmd_args.dynamo.workspace_path}/hf_home",
- ]
+ # hf_home_path is not set in cmd_args, so no custom mounts are expected
+ assert mounts == []📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_container_mounts(strategy: AIDynamoSlurmCommandGenStrategy, test_run: TestRun) -> None: | |
| mounts = strategy._container_mounts() | |
| td = cast(AIDynamoTestDefinition, test_run.test) | |
| dynamo_repo_path = td.dynamo_repo.installed_path | |
| assert dynamo_repo_path is not None, "dynamo_repo_path should be set in the test fixture" | |
| # _container_mounts returns custom mounts including scripts and HF home (git repos are handled by base class) | |
| assert mounts == [ | |
| f"{dynamo_repo_path!s}:{dynamo_repo_path!s}", | |
| f"{strategy.system.hf_home_path.absolute()!s}:{td.cmd_args.huggingface_home_container_path!s}", | |
| f"{td.script.installed_path.absolute()!s}:{td.script.installed_path.absolute()!s}", | |
| f"{strategy.system.hf_home_path.absolute()!s}:{td.cmd_args.dynamo.workspace_path}/hf_home", | |
| ] | |
| def test_container_mounts(strategy: AIDynamoSlurmCommandGenStrategy, test_run: TestRun) -> None: | |
| mounts = strategy._container_mounts() | |
| td = cast(AIDynamoTestDefinition, test_run.test) | |
| # hf_home_path is not set in cmd_args, so no custom mounts are expected | |
| assert mounts == [] |
🧰 Tools
🪛 GitHub Actions: CI
[error] 110-114: Test container mounts failed. Expected a non-empty container mount for hf_home_path in AIDynamoSlurmCommandGenStrategy._container_mounts, but the method returned []. Also see log indicating hf_home_path: None and storage_cache_dir: None.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py` around lines
109 - 116, The test fails because test_container_mounts expects an hf_home mount
derived from strategy.system.hf_home_path while AIDynamoCmdArgs.hf_home_path in
the fixture is None and _container_mounts currently only adds a mount when
td.cmd_args.hf_home_path is truthy; update _container_mounts (in
slurm_command_gen_strategy.py) to use td.cmd_args.hf_home_path if provided,
otherwise fall back to strategy.system.hf_home_path when building the hf_home
mount so the test’s expected mount is produced.
| def test_ai_dynamo_generate_report(slurm_system: SlurmSystem, ai_dynamo_tr: TestRun, csv_content: str) -> None: | ||
| strategy = AIDynamoReportGenerationStrategy(slurm_system, ai_dynamo_tr) | ||
| # The new implementation does not generate a report file | ||
| strategy.generate_report() | ||
|
|
||
| report_file = ai_dynamo_tr.output_path / "report.csv" | ||
| assert report_file.is_file(), "Report CSV was not generated." | ||
|
|
||
| report_content = report_file.read_text() | ||
|
|
||
| def split_into_sections(content: str) -> list[str]: | ||
| sections = content.split("\n\n") | ||
| return [s.strip() for s in sections if s.strip()] | ||
|
|
||
| def normalize_csv_section(section: str) -> str: | ||
| return section.replace('"', "").strip() | ||
|
|
||
| actual_sections = [normalize_csv_section(s) for s in split_into_sections(report_content)] | ||
| expected_sections = [normalize_csv_section(s) for s in split_into_sections(csv_content)] | ||
|
|
||
| # First section should match after normalization | ||
| assert actual_sections[0] == expected_sections[0], "First section (metrics) does not match" | ||
|
|
||
| # Second section should have our additional metric | ||
| second_section_lines = actual_sections[1].split("\n") | ||
| assert second_section_lines[0] == "Metric,Value", "Second section header does not match" | ||
| assert second_section_lines[1] == "Output Token Throughput (tokens/sec),24", "Throughput line does not match" | ||
| assert second_section_lines[2] == "Overall Output Tokens per Second per GPU,1.0", "Added metric line is incorrect" | ||
| assert second_section_lines[3:] == ["Request Throughput (per sec),1.23", "Request Count (count),40.00"], ( | ||
| "Remaining lines do not match" | ||
| ) | ||
|
|
||
| # Third section (GPU metrics) should be identical | ||
| assert actual_sections[2] == expected_sections[2], "Third section (GPU metrics) does not match" | ||
| # Just verify the method runs without error | ||
| assert True |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Weak test: assert True provides no behavioral verification.
After calling generate_report(), the test only asserts True, which is trivially true. Consider verifying a concrete postcondition (e.g., that no unexpected files are created, or that specific state is set) rather than relying solely on "no exception raised."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/workloads/ai_dynamo/test_report_gen_strategy.py` around lines 84 - 89,
The test uses a trivial assert True and doesn't verify behavior of
AIDynamoReportGenerationStrategy.generate_report; update
test_ai_dynamo_generate_report to assert a concrete postcondition: call
strategy.generate_report() and then verify either (a) no unexpected files were
created by checking the test temp directory (use pytest tmp_path or inspect the
working dir) or (b) expected state changes on the ai_dynamo_tr/TestRun object
(e.g., a report_generated flag, report_path, or metadata updated) or (c)
expected interactions with slurm_system (spy/monkeypatch methods and assert they
were called). Pick the side effect that matches your implementation (inspect
AIDynamoReportGenerationStrategy.generate_report, TestRun fields, and
SlurmSystem methods) and replace assert True with an assertion that checks that
side effect.
| # Empty the CSV file to test error handling | ||
| (ai_dynamo_tr.output_path / "genai_perf-report.csv").write_text("") | ||
| assert strategy.get_metric("invalid-metric") == METRIC_ERROR |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg -n 'genai_perf.*report' --type=py -A 3 -B 3Repository: NVIDIA/cloudai
Length of output: 2159
🏁 Script executed:
# Also check the specific test file to see the exact filenames
cat -n tests/workloads/ai_dynamo/test_report_gen_strategy.py | sed -n '60,120p'Repository: NVIDIA/cloudai
Length of output: 3028
🏁 Script executed:
# Find the strategy implementation to see which filename it reads
find . -type f -name "*.py" -path "*/ai_dynamo/*" | grep -v test | head -20Repository: NVIDIA/cloudai
Length of output: 344
🏁 Script executed:
cat -n src/cloudai/workloads/ai_dynamo/report_generation_strategy.py | head -150Repository: NVIDIA/cloudai
Length of output: 2972
🏁 Script executed:
# Search for get_metric implementation and CSV filename references
rg -n 'get_metric|genai_perf' src/cloudai/workloads/ai_dynamo/report_generation_strategy.py -A 3 -B 1Repository: NVIDIA/cloudai
Length of output: 273
Fix filename in line 116: use genai_perf_report.csv (underscore) not genai_perf-report.csv (dash).
The strategy's get_metric() method reads from genai_perf_report.csv (line 55 of report_generation_strategy.py), but line 116 writes an empty file to genai_perf-report.csv with a dash. This filename mismatch means the empty-file write targets a different file and the test never actually tests the empty-CSV error-handling path. Line 117 passes only because "invalid-metric" doesn't exist in the original CSV; the test is ineffective.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/workloads/ai_dynamo/test_report_gen_strategy.py` around lines 115 -
117, The test writes an empty CSV to the wrong filename (uses
genai_perf-report.csv with a dash) so it doesn't exercise the error path in
strategy.get_metric which reads genai_perf_report.csv (underscore); change the
write call in tests/workloads/ai_dynamo/test_report_gen_strategy.py to write to
(ai_dynamo_tr.output_path / "genai_perf_report.csv") before asserting
strategy.get_metric("invalid-metric") == METRIC_ERROR so the empty-file
error-handling in report_generation_strategy.py and the METRIC_ERROR check are
actually exercised.
| decode_worker: WorkerConfig = Field(default_factory=WorkerConfig) | ||
| decode_cmd: str = Field( | ||
| default="python3 -m dynamo.vllm", | ||
| serialization_alias="decode-cmd", | ||
| validation_alias=AliasChoices("decode-cmd", "decode_cmd"), | ||
| ) | ||
| prefill_worker: PrefillWorkerArgs | None = None | ||
| prefill_worker: WorkerConfig = Field(default_factory=WorkerConfig) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find all Python files that might contain prefill_worker usages
find . -name "*.py" -type f | head -20Repository: NVIDIA/cloudai
Length of output: 766
🏁 Script executed:
# Search for prefill_worker usages in the codebase
rg -n 'prefill_worker' --type=py -B 2 -A 2Repository: NVIDIA/cloudai
Length of output: 12303
🏁 Script executed:
# Find test_gen_prefill test
rg -n 'test_gen_prefill|def test.*prefill' --type=py -B 1 -A 10Repository: NVIDIA/cloudai
Length of output: 9665
🏁 Script executed:
# Look at test fixture setup to understand how agg/disagg is configured
rg -n '@pytest.fixture|def.*fixture|"agg"|"disagg"' tests/workloads/ai_dynamo/test_json_gen_strategy_kubernetes.py -B 2 -A 5Repository: NVIDIA/cloudai
Length of output: 873
🏁 Script executed:
# Check the constraint_check method in detail
sed -n '347,365p' src/cloudai/workloads/ai_dynamo/ai_dynamo.pyRepository: NVIDIA/cloudai
Length of output: 1112
🏁 Script executed:
# Find how test_gen_prefill fixture is set up
rg -n 'json_gen|AIDynamoKubernetesJsonGenStrategy' tests/workloads/ai_dynamo/test_json_gen_strategy_kubernetes.py -B 3 -A 3 | head -50Repository: NVIDIA/cloudai
Length of output: 1898
🏁 Script executed:
# Check conftest to see test fixtures
cat tests/workloads/ai_dynamo/conftest.pyRepository: NVIDIA/cloudai
Length of output: 127
🏁 Script executed:
# Find conftest files
find tests -name conftest.py -type fRepository: NVIDIA/cloudai
Length of output: 75
🏁 Script executed:
# Look at the conftest in the ai_dynamo test directory
cat tests/workloads/ai_dynamo/conftest.py 2>/dev/null || find tests/workloads -name conftest.py -type fRepository: NVIDIA/cloudai
Length of output: 40
🏁 Script executed:
# Check fixture definition more carefully - dynamo fixture shows the agg mode doesn't set prefill_worker
# Let's see what happens in the "agg" case
rg -n 'if request.param|@pytest.fixture.*params' tests/workloads/ai_dynamo/ -B 5 -A 15Repository: NVIDIA/cloudai
Length of output: 3966
prefill_worker default change breaks aggregated/disaggregated mode distinction.
The field now always defaults to WorkerConfig() instead of None/Optional, making it always truthy. This breaks all downstream logic that uses truthiness to distinguish modes:
- Line 135 in
test_gen_prefillexpects falsyprefill_workerfor aggregated mode; test will fail - Line 356 in
constraint_check:prefill_nodes = prefill_worker.num_nodes if prefill_worker else 0will incorrectly report 1 node instead of 0 for agg - Lines 115, 185 in
slurm_command_gen_strategy.pyand lines 53, 66, 99 inkubernetes_json_gen_strategy.pyall haveif prefill_worker:checks that must now distinguish between "not configured" vs "configured with defaults"
Aggregated mode must be detectable as falsy; use Optional[WorkerConfig] with default=None instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py` around lines 159 - 165, The
prefill_worker field in ai_dynamo.py must be Optional so aggregated mode remains
falsy; change the declaration of prefill_worker from WorkerConfig =
Field(default_factory=WorkerConfig) to Optional[WorkerConfig] with default None
(use Field(default=None) and add typing.Optional import) so downstream checks
like constraint_check (prefill_nodes = prefill_worker.num_nodes if
prefill_worker else 0), test_gen_prefill, and the if prefill_worker: branches in
slurm_command_gen_strategy.py and kubernetes_json_gen_strategy.py continue to
work correctly as configured vs not configured.
| def constraint_check(self, tr: TestRun) -> bool: | ||
| prefill_worker = tr.test.cmd_args.dynamo.prefill_worker | ||
| decode_worker = tr.test.cmd_args.dynamo.decode_worker | ||
|
|
||
| prefill_tp = prefill_worker.args.tensor_parallel_size if prefill_worker else 1 | ||
| decode_tp = decode_worker.args.tensor_parallel_size if decode_worker else 1 | ||
| prefill_pp = prefill_worker.args.pipeline_parallel_size if prefill_worker else 1 | ||
| decode_pp = decode_worker.args.pipeline_parallel_size if decode_worker else 1 | ||
| prefill_nodes = prefill_worker.num_nodes if prefill_worker else 0 | ||
| decode_nodes = decode_worker.num_nodes if decode_worker else 1 | ||
|
|
||
| if self.constraints.prefill_tp_le_decode_tp and prefill_tp > decode_tp: | ||
| logging.info("constraint_check failed for: prefill_tp_le_decode_tp") | ||
| return False | ||
| logging.info("constraint_check passed for: prefill_tp_le_decode_tp") | ||
|
|
||
| gpus_per_node = tr.test.cmd_args.gpus_per_node | ||
| if self.constraints.tp_times_pp_le_gpus_per_node and ( | ||
| prefill_tp * prefill_pp > gpus_per_node or decode_tp * decode_pp > gpus_per_node |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find WorkerBaseArgs definition
rg -n 'class WorkerBaseArgs' --type=py -A 15Repository: NVIDIA/cloudai
Length of output: 1480
🏁 Script executed:
#!/bin/bash
# Find where constraint_check is called
rg -n 'constraint_check' --type=py -B 5 -A 2Repository: NVIDIA/cloudai
Length of output: 10397
🏁 Script executed:
#!/bin/bash
# Look at the ai_dynamo.py file around constraint_check method
fd -e py -path "*/ai_dynamo/ai_dynamo.py" -exec head -400 {} \; | tail -100Repository: NVIDIA/cloudai
Length of output: 226
🏁 Script executed:
#!/bin/bash
# Find WorkerConfig definition and factory
rg -n 'class WorkerConfig|default_factory' --type=py -B 2 -A 5 | head -60Repository: NVIDIA/cloudai
Length of output: 4871
🏁 Script executed:
#!/bin/bash
# Find AIDynamoCmdArgs definition
rg -n 'class AIDynamoCmdArgs' --type=py -A 30Repository: NVIDIA/cloudai
Length of output: 2785
🏁 Script executed:
#!/bin/bash
# Look for list unrolling or field resolution logic
rg -n 'unroll|resolve|list.*int' --type=py | grep -i "ai_dynamo\|dynamo" | head -20Repository: NVIDIA/cloudai
Length of output: 625
🏁 Script executed:
#!/bin/bash
# Check test files for how ai_dynamo tests are created
find . -name "*test*ai_dynamo*" -o -name "*ai_dynamo*test*" | head -10Repository: NVIDIA/cloudai
Length of output: 40
🏁 Script executed:
#!/bin/bash
# Look at how cmd_args are populated/passed to constraint_check
rg -n 'apply_params_set' --type=py -B 2 -A 5 | head -50Repository: NVIDIA/cloudai
Length of output: 3893
🏁 Script executed:
#!/bin/bash
# Find test files for ai_dynamo that might test constraint_check
find . -type f -name "*.py" -path "*test*" | xargs grep -l "ai_dynamo\|AIDynamo" 2>/dev/nullRepository: NVIDIA/cloudai
Length of output: 346
🏁 Script executed:
#!/bin/bash
# Look at the full WorkerBaseArgs definition to see all field defaults
sed -n '74,130p' src/cloudai/workloads/ai_dynamo/ai_dynamo.pyRepository: NVIDIA/cloudai
Length of output: 1885
🏁 Script executed:
#!/bin/bash
# Check how apply_params_set works - look at TestRun class
rg -n 'def apply_params_set' --type=py -A 20Repository: NVIDIA/cloudai
Length of output: 1621
🏁 Script executed:
#!/antml:bash
# Look for actual test cases or examples where AIDynamoCmdArgs is instantiated
rg -n 'AIDynamoCmdArgs\|AIDynamoTestDefinition' --type=py -B 2 -A 5 | head -60Repository: NVIDIA/cloudai
Length of output: 40
🏁 Script executed:
#!/bin/bash
# Look at ai_dynamo tests to see how constraint_check is tested or called
cat tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py | head -100Repository: NVIDIA/cloudai
Length of output: 3232
🏁 Script executed:
#!/bin/bash
# Look for where TP/PP fields might be set in actual test cases
rg -n 'tensor_parallel_size|pipeline_parallel_size' --type=py tests/workloads/ai_dynamo/ -B 2 -A 2Repository: NVIDIA/cloudai
Length of output: 1169
🏁 Script executed:
#!/bin/bash
# Check what the actual slurm strategy does - does it validate these fields?
cat src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py | head -150Repository: NVIDIA/cloudai
Length of output: 6282
🏁 Script executed:
#!/bin/bash
# Check if there's any validation enforcing non-None values in WorkerBaseArgs or WorkerConfig
rg -n 'validate|field_validator|model_validator' src/cloudai/workloads/ai_dynamo/ai_dynamo.py -A 5Repository: NVIDIA/cloudai
Length of output: 856
🏁 Script executed:
#!/bin/bash
# Check where AIDynamoArgs is initialized and if there's any default handling
sed -n '130,180p' src/cloudai/workloads/ai_dynamo/ai_dynamo.pyRepository: NVIDIA/cloudai
Length of output: 1671
🏁 Script executed:
#!/antml:bash
# Check the entire constraint_check method and see the full context
sed -n '348,380p' src/cloudai/workloads/ai_dynamo/ai_dynamo.pyRepository: NVIDIA/cloudai
Length of output: 1828
🏁 Script executed:
#!/bin/bash
# Check if there's any validation that enforces non-None TP/PP values
rg -n 'field_validator|model_validator' src/cloudai/workloads/ai_dynamo/ai_dynamo.py -B 2 -A 10 | grep -A 10 "tensor_parallel_size\|pipeline_parallel_size"Repository: NVIDIA/cloudai
Length of output: 40
🏁 Script executed:
#!/bin/bash
# Look for any defaults or post-init logic that sets TP/PP fields
rg -n '__post_init__|model_post_init' src/cloudai/workloads/ai_dynamo/ai_dynamo.py -A 10Repository: NVIDIA/cloudai
Length of output: 40
🏁 Script executed:
#!/antml:bash
# Test if None values actually make it through - create a minimal test case
python3 << 'EOF'
# Test what happens with None > None
try:
result = None > None
print(f"None > None = {result}")
except TypeError as e:
print(f"TypeError: {e}")
# Test what happens with None * int
try:
result = None * 8
print(f"None * 8 = {result}")
except TypeError as e:
print(f"TypeError: {e}")
# Test what happens with (None or 1)
result = None or 1
print(f"(None or 1) = {result}")
EOFRepository: NVIDIA/cloudai
Length of output: 216
🏁 Script executed:
#!/bin/bash
# Look for actual test scenario files to see how real tests are configured
find tests -name "*.yaml" -o -name "*.yml" | head -10Repository: NVIDIA/cloudai
Length of output: 40
🏁 Script executed:
#!/bin/bash
# Check if there's any example or default configuration that shows typical usage
find . -name "*.yaml" -o -name "*.yml" | xargs grep -l "ai_dynamo\|ai-dynamo" 2>/dev/nullRepository: NVIDIA/cloudai
Length of output: 42
🏁 Script executed:
#!/antml:bash
# Check if constraint_check is actually called anywhere in the codebase and if there's guard logic
rg -n 'constraint_check.*ai_dynamo\|AIDynamo.*constraint_check' --type=py -B 5 -A 5Repository: NVIDIA/cloudai
Length of output: 40
🏁 Script executed:
#!/bin/bash
# Check if there's any loader or parser that enforces tensor_parallel_size to be non-None
rg -n 'tensor_parallel_size\|pipeline_parallel_size' src/cloudai/workloads/ai_dynamo/ --type=py | grep -v "test" | head -20Repository: NVIDIA/cloudai
Length of output: 40
🏁 Script executed:
#!/antml:bash
# See if WorkerBaseArgs or WorkerConfig have any post-init validation
cat src/cloudai/workloads/ai_dynamo/ai_dynamo.py | grep -A 30 "class WorkerBaseArgs"Repository: NVIDIA/cloudai
Length of output: 1249
🏁 Script executed:
#!/bin/bash
# Check if there's any code that would populate or validate tensor_parallel_size before constraint_check
rg -n 'tensor_parallel_size.*=' src/cloudai/workloads/ai_dynamo/ --type=py -B 2 -A 2 | head -40Repository: NVIDIA/cloudai
Length of output: 526
🏁 Script executed:
#!/bin/bash
# Check if there are any tests that exercise constraint_check with default/None TP/PP values
rg -n 'def test.*constraint' tests/workloads/ai_dynamo/ --type=py -A 20 | head -100Repository: NVIDIA/cloudai
Length of output: 40
🏁 Script executed:
#!/antml:bash
# Check if there are integration tests that would catch this issue
cat tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.pyRepository: NVIDIA/cloudai
Length of output: 4700
🏁 Script executed:
#!/antml:bash
# Check if there's any validation enforcing TP/PP when WorkerConfig is created
rg -n '@field_validator|@model_validator' src/cloudai/workloads/ai_dynamo/ai_dynamo.py -B 2 -A 15 | grep -A 15 "WorkerConfig\|WorkerBaseArgs"Repository: NVIDIA/cloudai
Length of output: 700
🏁 Script executed:
#!/antml:bash
# Check if there's any schema or validator that would enforce non-None TP/PP before constraint_check
python3 << 'EOF'
import sys
sys.path.insert(0, '/root')
# Try to create a WorkerConfig with defaults and see if it validates
try:
from cloudai.workloads.ai_dynamo.ai_dynamo import WorkerConfig, WorkerBaseArgs
# Create with defaults
base_args = WorkerBaseArgs()
worker = WorkerConfig(args=base_args)
print(f"WorkerConfig created successfully")
print(f"tensor_parallel_size: {worker.args.tensor_parallel_size}")
print(f"pipeline_parallel_size: {worker.args.pipeline_parallel_size}")
# Now test constraint_check logic
prefill_tp = worker.args.tensor_parallel_size if worker else 1
decode_tp = worker.args.tensor_parallel_size if worker else 1
print(f"prefill_tp: {prefill_tp}")
print(f"decode_tp: {decode_tp}")
# Try the comparison
try:
result = prefill_tp > decode_tp
print(f"Comparison result: {result}")
except TypeError as e:
print(f"TypeError during comparison: {e}")
except Exception as e:
print(f"Error: {type(e).__name__}: {e}")
import traceback
traceback.print_exc()
EOFRepository: NVIDIA/cloudai
Length of output: 248
constraint_check will raise TypeError when worker TP/PP fields are None.
WorkerConfig uses default_factory, so prefill_worker and decode_worker are always non-falsy objects. When tensor_parallel_size or pipeline_parallel_size fields are None (the defaults in WorkerBaseArgs), the comparisons at line 359 (None > None) and multiplications at line 366 (None * None) will raise TypeError.
🐛 Proposed fix: default None to safe values
- prefill_tp = prefill_worker.args.tensor_parallel_size if prefill_worker else 1
- decode_tp = decode_worker.args.tensor_parallel_size if decode_worker else 1
- prefill_pp = prefill_worker.args.pipeline_parallel_size if prefill_worker else 1
- decode_pp = decode_worker.args.pipeline_parallel_size if decode_worker else 1
+ prefill_tp = (prefill_worker.args.tensor_parallel_size or 1) if prefill_worker else 1
+ decode_tp = (decode_worker.args.tensor_parallel_size or 1) if decode_worker else 1
+ prefill_pp = (prefill_worker.args.pipeline_parallel_size or 1) if prefill_worker else 1
+ decode_pp = (decode_worker.args.pipeline_parallel_size or 1) if decode_worker else 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py` around lines 348 - 366, In
constraint_check, avoid TypeError when tensor_parallel_size or
pipeline_parallel_size are None by coercing those fields to safe defaults (e.g.,
1) before comparing or multiplying; update the reads for prefill_tp, decode_tp,
prefill_pp, decode_pp (derived from prefill_worker.args.tensor_parallel_size and
.pipeline_parallel_size, and decode_worker.args.*) to use a fallback like "value
if value is not None else 1" (or a common helper) so comparisons in
constraint_check and the tp*pp multiplications never operate on None.
| local -a args_arr=() | ||
| for key in "${!decode_args[@]}"; do | ||
| args_arr+=($key $(replace_placeholders "${decode_args[$key]}")) | ||
| done | ||
|
|
||
| log "Launching decode worker $i on GPUs $gpu_list (NIXL port: $nixl_port, KV event port: $kv_event_port, KVBM pub/ack: $kvbm_pub_port/$kvbm_ack_port)" | ||
| log "Decode cmd: ${decode_config["cmd"]} ${args_arr[*]} ${decode_config["extra-args"]}" | ||
| CUDA_VISIBLE_DEVICES=$gpu_list \ | ||
| VLLM_NIXL_SIDE_CHANNEL_HOST=$(hostname -I | awk '{print $1}') \ | ||
| VLLM_NIXL_SIDE_CHANNEL_PORT=$nixl_port \ | ||
| DYN_VLLM_KV_EVENT_PORT=$kv_event_port \ | ||
| ${dynamo_args["decode-cmd"]} \ | ||
| $(array_to_args decode_args) ${decode_args["--extra-args"]} > $log_file 2>&1 & | ||
| DYN_KVBM_LEADER_ZMQ_PUB_PORT=$kvbm_pub_port \ | ||
| DYN_KVBM_LEADER_ZMQ_ACK_PORT=$kvbm_ack_port \ | ||
| ${decode_config["cmd"]} \ | ||
| ${args_arr[@]} \ | ||
| ${decode_config["extra-args"]} > $log_file 2>&1 & |
There was a problem hiding this comment.
Unquoted array construction and expansion corrupts multi-word argument values.
Line 776: args_arr+=($key $(replace_placeholders ...)) undergoes word splitting, so a value like "genai-perf profile" becomes two separate array elements. Then at line 788, ${args_arr[@]} (unquoted, SC2068 error) re-splits again. Both the construction and the expansion should be quoted.
Similarly, ${decode_config["extra-args"]} and ${decode_config["cmd"]} at lines 787-789 are unquoted (SC2086), which will break on paths/commands with spaces.
Proposed fix
local -a args_arr=()
for key in "${!decode_args[@]}"; do
- args_arr+=($key $(replace_placeholders "${decode_args[$key]}"))
+ args_arr+=("$key" "$(replace_placeholders "${decode_args[$key]}")")
done
...
- ${decode_config["cmd"]} \
- ${args_arr[@]} \
- ${decode_config["extra-args"]} > $log_file 2>&1 &
+ "${decode_config["cmd"]}" \
+ "${args_arr[@]}" \
+ ${decode_config["extra-args"]} > "$log_file" 2>&1 &The same issue exists in the prefill launch path at lines 829-844.
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 776-776: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.
(SC2206)
[warning] 776-776: Prefer mapfile or read -a to split command output (or quote to avoid splitting).
(SC2207)
[error] 788-788: Double quote array expansions to avoid re-splitting elements.
(SC2068)
[info] 789-789: Double quote to prevent globbing and word splitting.
(SC2086)
[info] 789-789: Double quote to prevent globbing and word splitting.
(SC2086)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.sh` around lines 774 - 789, The
array construction and expansions are unquoted which causes word-splitting and
corrupts multi-word arguments: change args_arr+=($key $(replace_placeholders
"${decode_args[$key]}")) to use quoted elements (e.g., args_arr+=("$key"
"$(replace_placeholders "${decode_args[$key]}")")) and always expand the array
as "${args_arr[@]}"; also quote command and extra-args when invoking the process
(use "${decode_config["cmd"]}" and "${decode_config["extra-args"]}" or split
extra-args into an array) so paths/commands with spaces aren't broken; apply the
same quoting fixes to the prefill launch path where args_arr,
replace_placeholders, decode_config["cmd"] and decode_config["extra-args"] are
used, and keep redirection to "$log_file" quoted as well.
| workloads = "genai_perf.sh" | ||
|
|
||
| [cmd_args.dynamo] | ||
| backend = "vllm" |
There was a problem hiding this comment.
backend set to "vllm" but should be "sglang" for sglang test config
| backend = "vllm" | |
| backend = "sglang" |
| # launch genai-perf | ||
| log "Launching genai-perf with args: $full_cmd" | ||
|
|
||
| eval "$full_cmd" |
There was a problem hiding this comment.
eval with user-controlled input creates command injection risk if genai_perf_args contains shell metacharacters
| local IFS=',' arr i | ||
| read -ra arr <<< "$list" | ||
| for i in "${!arr[@]}"; do | ||
| if [[ "${arr[$i]}" == "$name" || "${arr[$i]}" == *"$name"* || "$name" == *"${arr[$i]}"* ]]; then | ||
| if [[ "${arr[$i]}" == "$name" ]]; then | ||
| echo "$i"; return 0 | ||
| fi | ||
| done | ||
| echo "-1" | ||
| } |
There was a problem hiding this comment.
exact match logic changed from fuzzy matching (*"$name"*) to strict equality, may break existing node specifications that relied on substring matching
| if td.cmd_args.hf_home_path: | ||
| result.append(f"{td.cmd_args.hf_home_path}:{td.cmd_args.hf_home_path}") |
There was a problem hiding this comment.
mount maps hf_home_path to itself, but test expects system.hf_home_path mapped to workspace_path/hf_home (see test_command_gen_strategy_slurm.py:115)
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cloudai/systems/kubernetes/kubernetes_system.py (1)
362-371:⚠️ Potential issue | 🟠 Major
shell=Truewith string-joined command for result copy — same injection risk as line 309.
cp_logs_cmdis built via" ".join(...)and passed tosubprocess.run(..., shell=True). If any component (namespace, pod name, output path) contains shell metacharacters, this is exploitable. Use a list-based command to avoid shell interpretation.Proposed fix
- cp_logs_cmd = " ".join( - [ - "kubectl", - "cp", - f"{self.default_namespace}/{frontend_pod}:{genai_perf_results_path}", - str(job.test_run.output_path / "genai-perf"), - ] - ) - logging.debug(f"Copying genai-perf results with command: {cp_logs_cmd}") - p = subprocess.run(cp_logs_cmd, shell=True, capture_output=True, text=True) + cp_logs_cmd = [ + "kubectl", + "cp", + f"{self.default_namespace}/{frontend_pod}:{genai_perf_results_path}", + str(job.test_run.output_path / "genai-perf"), + ] + logging.debug(f"Copying genai-perf results with command: {cp_logs_cmd}") + p = subprocess.run(cp_logs_cmd, capture_output=True, text=True) logging.debug(f"Returned code {p.returncode}, stdout: {p.stdout}, stderr: {p.stderr}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cloudai/systems/kubernetes/kubernetes_system.py` around lines 362 - 371, The code builds cp_logs_cmd from components (cp_logs_cmd variable) and calls subprocess.run(..., shell=True) which risks shell injection if self.default_namespace, frontend_pod, genai_perf_results_path or job.test_run.output_path contain metacharacters; change to pass a list argv to subprocess.run (e.g., ["kubectl","cp", f"{self.default_namespace}/{frontend_pod}:{genai_perf_results_path}", str(job.test_run.output_path / "genai-perf")]) and remove shell=True, keep capture_output=True and text=True to preserve behavior and logging using logging.debug of the joined command for readability.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cloudai/systems/kubernetes/kubernetes_system.py`:
- Around line 316-320: The try/except around subprocess.run(kubectl_exec_cmd,
...) swallows failures for making the wrapper executable which can mask setup
errors; update the block in kubernetes_system.py (the subprocess.run call that
builds kubectl_exec_cmd and uses frontend_pod) to check result.returncode and,
on non-zero, log at warning/error and raise or return a failure, and change the
broad except Exception to either re-raise after logging or log at warning/error
level with the exception details so execution does not continue when chmod fails
(ensure the log includes kubectl_exec_cmd, frontend_pod and the
exception/result.returncode).
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py`:
- Around line 284-289: The validation loop for workloads uses
self.cmd_args.workloads.split(",") without trimming whitespace, causing
mismatches; update the parsing to strip each entry (e.g., when building
workloads and at the other call site referenced in the review) so entries like
"genai_perf.sh, other.sh" become ["genai_perf.sh","other.sh"]; modify the code
that constructs workloads (and the second call site mentioned) to apply .strip()
to each split token before comparing against
self.cmd_args.genai_perf.script.src.name and before passing to
validate_test_definition/was_run_successful.
In `@src/cloudai/workloads/ai_dynamo/report_generation_strategy.py`:
- Around line 28-29: Remove the inline import of pandas inside
extract_metric_from_csv and instead use the project's lazy import helper by
referencing lazy.pd (from lazy_imports) so pandas is accessed lazily and
consistently; update the function to use lazy.pd for any DataFrame operations
and ensure the module imports the lazy_imports symbol (or uses its existing
import) rather than doing import pandas as pd inside extract_metric_from_csv.
---
Outside diff comments:
In `@src/cloudai/systems/kubernetes/kubernetes_system.py`:
- Around line 362-371: The code builds cp_logs_cmd from components (cp_logs_cmd
variable) and calls subprocess.run(..., shell=True) which risks shell injection
if self.default_namespace, frontend_pod, genai_perf_results_path or
job.test_run.output_path contain metacharacters; change to pass a list argv to
subprocess.run (e.g., ["kubectl","cp",
f"{self.default_namespace}/{frontend_pod}:{genai_perf_results_path}",
str(job.test_run.output_path / "genai-perf")]) and remove shell=True, keep
capture_output=True and text=True to preserve behavior and logging using
logging.debug of the joined command for readability.
---
Duplicate comments:
In `@src/cloudai/systems/kubernetes/kubernetes_system.py`:
- Around line 322-332: The genai-perf extra_args block is incorrectly nested
under the guard for tdef.cmd_args.genai_perf.args so extra_args is skipped when
args is empty; update the code around genai_perf_cmd_parts,
tdef.cmd_args.genai_perf.args and extra_args so that you first append any
--key=value flags only if args is truthy (iterate
tdef.cmd_args.genai_perf.args.model_dump(...)), then de-indent the extra_args
handling so it runs regardless (check extra_args :=
tdef.cmd_args.genai_perf.extra_args, handle str vs list) and extend
genai_perf_cmd_parts appropriately.
- Around line 307-309: The current use of an f-string to build cp_wrapper_cmd
and subprocess.run(..., shell=True) in kubernetes_system.py (the cp_wrapper_cmd
variable and the subprocess.run call) introduces command-injection risk and
fails ruff; change to a list-style argv and remove shell=True — e.g., construct
args as ["kubectl","cp", wrapper_script_path,
f"{self.default_namespace}/{frontend_pod}:{pod_wrapper_path}"] (using the
existing variables wrapper_script_path, frontend_pod, pod_wrapper_path,
self.default_namespace) and call subprocess.run(args, capture_output=True,
text=True, check=True) so the command is executed without a shell and the ruff
formatting issues are resolved.
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py`:
- Around line 264-268: The two methods success_marker() and failure_marker()
return fixed string literals; refactor by introducing class-level constants
(e.g., SUCCESS_MARKER and FAILURE_MARKER) on the AI Dynamo class and have
success_marker() and failure_marker() reference those constants (or convert the
methods to `@property` that returns the class attributes) so the marker values are
centralized and easily configurable/testable; update any usages to rely on the
constants if appropriate.
- Around line 309-317: The installables list currently has self.hf_model
commented out (a leftover debug artifact); update the installables construction
in the installables property/method to include self.hf_model (either by
uncommenting it or by conditionally appending it when not None) alongside
self.docker_image, self.script, self.genai_perf_script and *self.git_repos so
the HF model is packaged consistently.
- Around line 200-205: The repo definition uses an SSH URL which requires SSH
keys; update the GitRepo instantiation (symbol: repo, type: Optional[GitRepo])
to use the HTTPS clone URL (e.g. "https://github.com/LMCache/LMCache.git")
instead of "git@github.com:LMCache/LMCache.git", keep the existing commit and
mount_as values, and ensure any deployment/docs note that no SSH key
provisioning is required for the HTTPS URL.
- Around line 159-165: The fields decode_worker and prefill_worker in
ai_dynamo.py are always truthy because they use
Field(default_factory=WorkerConfig), so downstream guards like in
constraint_check and the Slurm/Kubernetes strategy code never hit the else
branch; change both fields to be Optional[WorkerConfig] with a default of None
(e.g., decode_worker: Optional[WorkerConfig] = Field(default=None, ... ) and
prefill_worker: Optional[WorkerConfig] = Field(default=None, ... )), keep any
serialization_alias/validation_alias settings (e.g., "decode-cmd") intact, and
run/update any type checks or mypy annotations in functions referencing
decode_worker/prefill_worker to handle the None case.
- Around line 349-380: The constraint_check method can crash when WorkerBaseArgs
fields are None; update constraint_check to treat tensor_parallel_size and
pipeline_parallel_size as 1 when they are None by checking the attributes on
prefill_worker.args and decode_worker.args (e.g., use value =
prefill_worker.args.tensor_parallel_size or 1), and similarly for
pipeline_parallel_size; ensure prefill_tp, decode_tp, prefill_pp, decode_pp are
integers before running comparisons and multiplications so constraint_check
(involving prefill_worker, decode_worker, tensor_parallel_size,
pipeline_parallel_size) never attempts None > None or None * None.
- Around line 332-346: The loop currently uses a falsy-check
workload_map.get(workload) and a bare assert report_name is not None; change the
existence check to an explicit membership test (if workload not in
workload_map:) and use workload_map[workload] only after confirming presence,
and replace the bare assert in the code that reads report_name with an explicit
conditional that handles the missing report_name (e.g., log an error via
logging.error with context including workload, set result = False and continue
or raise a ValueError if that is the desired failure mode) so both missing
workload and missing report_name are handled deterministically in the functions
using workload_map.get(workload) and report_name.
In `@src/cloudai/workloads/ai_dynamo/report_generation_strategy.py`:
- Around line 48-49: The line doing benchmark_name, metric_name, metric_type =
metric.split(":") is unsafe because it will raise ValueError for strings with
not exactly two colons; update the logic around metric.split to use split(":",
2) to limit parts and then validate that the returned parts length is 3 (or
handle missing parts by filling defaults or raising a clear ValueError). Locate
the occurrence of metric.split in report_generation_strategy (the variables
benchmark_name, metric_name, metric_type) and replace with a guarded split +
check so malformed metric strings are handled gracefully with a descriptive
error or fallback values.
- Around line 32-40: The code checks only metric_type but uses df["Metric"] and
unguards float conversion; update the block to first verify "Metric" is in
df.columns before accessing df["Metric"], then locate the row via
df.loc[df["Metric"] == metric_name] and ensure it's non-empty; retrieve the cell
for the requested metric_type and if it's missing (row empty), NaN
(pandas.isna), or non-numeric, catch that by checking pandas.isna(value) or
attempting to convert inside a try/except (catch ValueError/TypeError) and
return METRIC_ERROR on failure; otherwise return the converted float. Ensure you
still check metric_type in df.columns and reference METRIC_ERROR, metric_type,
metric_name, df, and the selection expression df.loc[df["Metric"] ==
metric_name] when making changes.
| try: | ||
| result = subprocess.run(kubectl_exec_cmd, capture_output=True, text=True, timeout=60 * 10) | ||
| logging.debug(f"chmod exited {result.returncode}: {result.stdout} {result.stderr}") | ||
| except Exception as e: | ||
| logging.debug(f"Error making wrapper script executable in pod '{frontend_pod}': {e}") |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Silently swallowing chmod failure may mask setup errors.
If chmod fails, the wrapper script won't be executable, and the subsequent kubectl exec of the wrapper (line 349) will fail with an unrelated error. This is execution setup, not auxiliary log collection, so a bare except Exception with only debug-level logging is too lenient.
Consider either re-raising the exception or at least logging at warning level and checking result.returncode before proceeding.
Proposed fix
try:
result = subprocess.run(kubectl_exec_cmd, capture_output=True, text=True, timeout=60 * 10)
- logging.debug(f"chmod exited {result.returncode}: {result.stdout} {result.stderr}")
- except Exception as e:
- logging.debug(f"Error making wrapper script executable in pod '{frontend_pod}': {e}")
+ if result.returncode != 0:
+ logging.warning(
+ f"chmod failed in pod '{frontend_pod}' (rc={result.returncode}): {result.stderr}"
+ )
+ else:
+ logging.debug(f"chmod succeeded in pod '{frontend_pod}'")
+ except subprocess.TimeoutExpired:
+ logging.error(f"chmod timed out in pod '{frontend_pod}'")
+ raise
+ except Exception as e:
+ logging.warning(f"Error making wrapper script executable in pod '{frontend_pod}': {e}")
+ raiseBased on learnings: "log collection operations should use logging.debug for exceptions" — but this is execution setup, not log collection, so the learning does not apply here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cloudai/systems/kubernetes/kubernetes_system.py` around lines 316 - 320,
The try/except around subprocess.run(kubectl_exec_cmd, ...) swallows failures
for making the wrapper executable which can mask setup errors; update the block
in kubernetes_system.py (the subprocess.run call that builds kubectl_exec_cmd
and uses frontend_pod) to check result.returncode and, on non-zero, log at
warning/error and raise or return a failure, and change the broad except
Exception to either re-raise after logging or log at warning/error level with
the exception details so execution does not continue when chmod fails (ensure
the log includes kubectl_exec_cmd, frontend_pod and the
exception/result.returncode).
| workloads = self.cmd_args.workloads.split(",") | ||
| for workload in workloads: | ||
| if workload not in [ | ||
| self.cmd_args.genai_perf.script.src.name, | ||
| ]: | ||
| raise ValueError(f"Invalid workload script: {workload}") |
There was a problem hiding this comment.
Missing .strip() after split(",") can silently break workload validation and lookup.
If cmd_args.workloads contains spaces after commas (e.g. "genai_perf.sh, other.sh"), split(",") produces entries with leading/trailing whitespace (e.g. " other.sh"). These will never match the script filename, causing validate_test_definition to raise ValueError for valid workloads and was_run_successful (line 333) to silently mark the result as False.
🐛 Proposed fix (both call sites)
- workloads = self.cmd_args.workloads.split(",")
+ workloads = [w.strip() for w in self.cmd_args.workloads.split(",")]- for workload in self.cmd_args.workloads.split(","):
+ for workload in (w.strip() for w in self.cmd_args.workloads.split(",")):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py` around lines 284 - 289, The
validation loop for workloads uses self.cmd_args.workloads.split(",") without
trimming whitespace, causing mismatches; update the parsing to strip each entry
(e.g., when building workloads and at the other call site referenced in the
review) so entries like "genai_perf.sh, other.sh" become
["genai_perf.sh","other.sh"]; modify the code that constructs workloads (and the
second call site mentioned) to apply .strip() to each split token before
comparing against self.cmd_args.genai_perf.script.src.name and before passing to
validate_test_definition/was_run_successful.
| def extract_metric_from_csv(self, csv_file: Path, metric_name: str, metric_type: str) -> float: | ||
| import pandas as pd |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Module-level TID253 violation is resolved; optionally adopt lazy.pd for consistency.
Moving the import inside the method resolves the CI failure. The project's lazy_imports.py provides lazy.pd as the canonical pattern for deferred pandas access. Since TID253 is satisfied, this is a style-only suggestion.
♻️ Proposed refactor (optional)
+from cloudai.util.lazy_imports import lazy
class AIDynamoReportGenerationStrategy(ReportGenerationStrategy):
def extract_metric_from_csv(self, csv_file: Path, metric_name: str, metric_type: str) -> float:
- import pandas as pd
-
- df = pd.read_csv(csv_file)
+ df = lazy.pd.read_csv(csv_file)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cloudai/workloads/ai_dynamo/report_generation_strategy.py` around lines
28 - 29, Remove the inline import of pandas inside extract_metric_from_csv and
instead use the project's lazy import helper by referencing lazy.pd (from
lazy_imports) so pandas is accessed lazily and consistently; update the function
to use lazy.pd for any DataFrame operations and ensure the module imports the
lazy_imports symbol (or uses its existing import) rather than doing import
pandas as pd inside extract_metric_from_csv.
Co-authored-by: Cursor <cursoragent@cursor.com>
| case $key in | ||
| --workloads) | ||
| dynamo_args["workloads"]="$2" ;; | ||
| --dynamo-*) | ||
| dynamo_args["${key#--dynamo-}"]="$2" ;; | ||
| --workloads) | ||
| dynamo_args["workloads"]="$2" ;; |
There was a problem hiding this comment.
duplicate --workloads case in switch statement
| case $key in | |
| --workloads) | |
| dynamo_args["workloads"]="$2" ;; | |
| --dynamo-*) | |
| dynamo_args["${key#--dynamo-}"]="$2" ;; | |
| --workloads) | |
| dynamo_args["workloads"]="$2" ;; | |
| --dynamo-*) | |
| dynamo_args["${key#--dynamo-}"]="$2" ;; | |
| --workloads) | |
| dynamo_args["workloads"]="$2" ;; | |
| --prefill-args-*) | |
| prefill_args["--${key#--prefill-args-}"]="$2" ;; |
|
|
||
| assert isinstance(prefill_n, int), "prefill_worker.num_nodes must be an integer" | ||
| assert isinstance(decode_n, int), "decode_worker.num_nodes must be an integer" | ||
| assert isinstance(prefill_n, int), "dynamo.num_prefill_nodes must be an integer" |
There was a problem hiding this comment.
assertion error messages reference wrong field names - should be decode_worker.num_nodes not dynamo.num_decode_nodes
| assert isinstance(prefill_n, int), "prefill_worker.num_nodes must be an integer" | |
| assert isinstance(decode_n, int), "decode_worker.num_nodes must be an integer" | |
| assert isinstance(prefill_n, int), "dynamo.num_prefill_nodes must be an integer" | |
| assert isinstance(prefill_n, int), "prefill_worker.num_nodes must be an integer" | |
| assert isinstance(decode_n, int), "decode_worker.num_nodes must be an integer" |
|
|
||
| _populate_nodelist() { | ||
| local num_nodes="$1" | ||
| local exclude_nodelist="$2" |
There was a problem hiding this comment.
trailing whitespace after variable assignment
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Additional Comments (1)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/cloudai/systems/kubernetes/kubernetes_system.py (2)
359-368:⚠️ Potential issue | 🟠 MajorSame
shell=Truepattern for result copy command.Lines 359–368 build a command string via
" ".join(...)and pass it withshell=True. Apply the same list-based fix as suggested for Line 308.🔒 Proposed fix
- cp_logs_cmd = " ".join( - [ - "kubectl", - "cp", - f"{self.default_namespace}/{frontend_pod}:{genai_perf_results_path}", - str(job.test_run.output_path / "genai-perf"), - ] - ) + cp_logs_cmd = [ + "kubectl", + "cp", + f"{self.default_namespace}/{frontend_pod}:{genai_perf_results_path}", + str(job.test_run.output_path / "genai-perf"), + ] logging.debug(f"Copying genai-perf results with command: {cp_logs_cmd}") - p = subprocess.run(cp_logs_cmd, shell=True, capture_output=True, text=True) + p = subprocess.run(cp_logs_cmd, capture_output=True, text=True) logging.debug(f"Returned code {p.returncode}, stdout: {p.stdout}, stderr: {p.stderr}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cloudai/systems/kubernetes/kubernetes_system.py` around lines 359 - 368, The code builds cp_logs_cmd as a single string and calls subprocess.run(..., shell=True) which is unsafe; instead construct the command as a list of args (e.g. ["kubectl","cp", f"{self.default_namespace}/{frontend_pod}:{genai_perf_results_path}", str(job.test_run.output_path / "genai-perf")]) and call subprocess.run with that list and shell=False (remove shell=True and capture_output/text as before) so the subprocess invocation in kubernetes_system.py uses a list argument form rather than joining into cp_logs_cmd.
356-357:⚠️ Potential issue | 🟠 MajorGenai-perf execution failure is silently swallowed.
If the
kubectl execrunning the genai-perf wrapper (Line 349) throws an exception (e.g.,TimeoutExpired, connection error), it's caught by the bareexcept Exceptionand logged only atdebuglevel. This is actual benchmark execution, not auxiliary log collection — a failure here means the workload didn't run successfully. Downstream code (log copy, result parsing) will produce misleading or missing results.Log at
warning/errorand consider re-raising or returning a failure status.Proposed fix
except Exception as e: - logging.debug(f"Error executing genai-perf command in pod '{frontend_pod}': {e}") + logging.error(f"Error executing genai-perf command in pod '{frontend_pod}': {e}") + raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cloudai/systems/kubernetes/kubernetes_system.py` around lines 356 - 357, The except block catching exceptions from the genai-perf kubectl exec currently logs at debug and swallows failures; change that to log at warning or error (e.g., logging.error) including the exception details and either re-raise the exception or return a clear failure status so callers know the benchmark failed; update the try/except around the genai-perf execution (referencing frontend_pod and the genai-perf exec call) to log with higher severity and propagate the failure (raise or return False/None) so downstream log copying/result parsing does not proceed on a silent failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cloudai/systems/kubernetes/kubernetes_system.py`:
- Line 52: The unused parameters in KubernetesSystem methods should be renamed
to underscore-prefixed names to silence linter warnings: change the __deepcopy__
signature parameter memo to _memo (i.e., def __deepcopy__(self, _memo: dict[int,
Any] | None = None) -> "KubernetesSystem") and rename the model_post_init
parameter __context to _context (or similar single underscore) while leaving
their types and behavior unchanged; update any internal references (if any) to
use the new names, or leave them unused, so the linter no longer flags unused
parameters in __deepcopy__ and model_post_init.
---
Outside diff comments:
In `@src/cloudai/systems/kubernetes/kubernetes_system.py`:
- Around line 359-368: The code builds cp_logs_cmd as a single string and calls
subprocess.run(..., shell=True) which is unsafe; instead construct the command
as a list of args (e.g. ["kubectl","cp",
f"{self.default_namespace}/{frontend_pod}:{genai_perf_results_path}",
str(job.test_run.output_path / "genai-perf")]) and call subprocess.run with that
list and shell=False (remove shell=True and capture_output/text as before) so
the subprocess invocation in kubernetes_system.py uses a list argument form
rather than joining into cp_logs_cmd.
- Around line 356-357: The except block catching exceptions from the genai-perf
kubectl exec currently logs at debug and swallows failures; change that to log
at warning or error (e.g., logging.error) including the exception details and
either re-raise the exception or return a clear failure status so callers know
the benchmark failed; update the try/except around the genai-perf execution
(referencing frontend_pod and the genai-perf exec call) to log with higher
severity and propagate the failure (raise or return False/None) so downstream
log copying/result parsing does not proceed on a silent failure.
---
Duplicate comments:
In `@src/cloudai/systems/kubernetes/kubernetes_system.py`:
- Around line 314-318: The try/except around subprocess.run(kubectl_exec_cmd,
...) for making the wrapper script executable silently swallows failures; update
the block that runs kubectl_exec_cmd (used to chmod in the pod for frontend_pod)
to check result.returncode after subprocess.run and if non-zero log at
warning/error level including result.stdout and result.stderr, and in the except
handler log a warning that includes the exception and relevant context
(frontend_pod and the command) or re-raise to fail fast so downstream kubectl
exec calls aren't attempted with a non-executable script.
- Around line 320-329: The code builds genai-perf args from
tdef.cmd_args.genai_perf.args, but GenAIPerf stores its config as Pydantic extra
fields on the GenAIPerf object, so reading the nested .args yields an empty
dict; fix by dumping the GenAIPerf model itself
(tdef.cmd_args.genai_perf.model_dump(exclude={"name","cmd","script","args"},
exclude_none=True)) and iterate that dict to append --key=value to
genai_perf_cmd_parts, then handle tdef.cmd_args.genai_perf.extra_args the same
way as before (supporting both str and list) and keep using genai_perf_cmd_parts
as the assembled command; reference symbols: genai_perf_cmd_parts,
tdef.cmd_args.genai_perf, model_dump, extra_args.
- Around line 307-309: The subprocess.run call that builds cp_wrapper_cmd with
an f-string and shell=True risks command injection; change to a list-based
invocation: construct a list command using the same variables
(wrapper_script_path, self.default_namespace, frontend_pod, pod_wrapper_path) —
e.g. ["kubectl","cp", wrapper_script_path,
f"{self.default_namespace}/{frontend_pod}:{pod_wrapper_path}"] — and call
subprocess.run(cmd_list, capture_output=True, text=True, check=True) without
shell=True so arguments are passed safely.
- Fix sglang.toml: backend "vllm" -> "sglang", remove hardcoded hf_home_path, trim trailing whitespace in etcd-cmd - Fix sglang_slurm.toml: remove stale "kvbm" from scenario name, fix test id - Fix vllm_kvbm_slurm.toml: consistent TOML formatting for env vars - Fix ai_dynamo.sh: remove duplicate --workloads case arm - Fix kubernetes_system.py: replace shell=True with list-based subprocess call - Fix ai_dynamo.py: LMCache SSH URL -> HTTPS, replace assert with proper guard - Fix genai_perf.sh: remove unused variables (repo, endpoint, report_file) Co-authored-by: Cursor <cursoragent@cursor.com>
| # launch genai-perf | ||
| log "Launching genai-perf with args: $full_cmd" | ||
|
|
||
| eval "$full_cmd" |
There was a problem hiding this comment.
eval with user-controlled $cmdline_args creates command injection vulnerability if genai_perf_args contains shell metacharacters like ;, |, $(), etc.
| assert isinstance(prefill_n, int), "dynamo.num_prefill_nodes must be an integer" | ||
| assert isinstance(decode_n, int), "dynamo.num_decode_nodes must be an integer" |
There was a problem hiding this comment.
assertion error messages reference wrong field names - should be prefill_worker.num_nodes and decode_worker.num_nodes, not dynamo.num_prefill_nodes and dynamo.num_decode_nodes
| assert isinstance(prefill_n, int), "dynamo.num_prefill_nodes must be an integer" | |
| assert isinstance(decode_n, int), "dynamo.num_decode_nodes must be an integer" | |
| assert isinstance(prefill_n, int), "prefill_worker.num_nodes must be an integer" | |
| assert isinstance(decode_n, int), "decode_worker.num_nodes must be an integer" |
Remove PrefillWorkerArgs/DecodeWorkerArgs in favor of WorkerBaseArgs with dynamic attribute assignment via setattr(). Promote commonly overridden fields (ingress-cmd, node-setup-cmd, etcd-cmd, nats-cmd) into AIDynamoArgs defaults, simplifying TOML configs. Update genai_perf.sh to handle model/url/endpoint through the args array. Rename sglang backend to sglang_dsr1 in ai_dynamo.sh. Co-authored-by: Cursor <cursoragent@cursor.com>
| prefill_tp = prefill_worker.args.tensor_parallel_size if prefill_worker else 1 | ||
| decode_tp = decode_worker.args.tensor_parallel_size if decode_worker else 1 | ||
| prefill_pp = prefill_worker.args.pipeline_parallel_size if prefill_worker else 1 | ||
| decode_pp = decode_worker.args.pipeline_parallel_size if decode_worker else 1 | ||
| prefill_nodes = prefill_worker.num_nodes if prefill_worker else 0 | ||
| decode_nodes = decode_worker.num_nodes if decode_worker else 1 |
There was a problem hiding this comment.
tensor_parallel_size and pipeline_parallel_size can be None per WorkerBaseArgs definition, but constraint_check defaults to 1 when worker is falsy, not when field is None. If prefill_worker exists but prefill_worker.args.tensor_parallel_size is None, this will use None instead of defaulting to 1.
Add # noqa: B010 to setattr calls (required for dynamic Pydantic extra fields and hyphenated attribute names). Add --prefill-args-model and --decode-args-model to the reference sbatch file to match the dynamically populated worker args. Co-authored-by: Cursor <cursoragent@cursor.com>
| prefill_tp = prefill_worker.args.tensor_parallel_size if prefill_worker else 1 | ||
| decode_tp = decode_worker.args.tensor_parallel_size if decode_worker else 1 | ||
| prefill_pp = prefill_worker.args.pipeline_parallel_size if prefill_worker else 1 | ||
| decode_pp = decode_worker.args.pipeline_parallel_size if decode_worker else 1 |
There was a problem hiding this comment.
Doesn't handle None values for tensor_parallel_size and pipeline_parallel_size. Per WorkerBaseArgs (lines 79-98), these fields can be None. When prefill_worker exists but prefill_worker.args.tensor_parallel_size is None, this will evaluate to None instead of defaulting to 1.
| prefill_tp = prefill_worker.args.tensor_parallel_size if prefill_worker else 1 | |
| decode_tp = decode_worker.args.tensor_parallel_size if decode_worker else 1 | |
| prefill_pp = prefill_worker.args.pipeline_parallel_size if prefill_worker else 1 | |
| decode_pp = decode_worker.args.pipeline_parallel_size if decode_worker else 1 | |
| prefill_tp = prefill_worker.args.tensor_parallel_size if prefill_worker and prefill_worker.args.tensor_parallel_size is not None else 1 | |
| decode_tp = decode_worker.args.tensor_parallel_size if decode_worker and decode_worker.args.tensor_parallel_size is not None else 1 | |
| prefill_pp = prefill_worker.args.pipeline_parallel_size if prefill_worker and prefill_worker.args.pipeline_parallel_size is not None else 1 | |
| decode_pp = decode_worker.args.pipeline_parallel_size if decode_worker and decode_worker.args.pipeline_parallel_size is not None else 1 |
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cloudai/systems/kubernetes/kubernetes_system.py (1)
364-374: 🧹 Nitpick | 🔵 Trivial
cp_logs_cmdstill usesshell=True— inconsistent with the fix applied tocp_wrapper_cmd.The earlier
cp_wrapper_cmdwas correctly migrated to a list-basedsubprocess.run. This remainingshell=Truecall is unnecessary and reintroduces the shell-injection surface.♻️ Proposed fix
- cp_logs_cmd = " ".join( - [ - "kubectl", - "cp", - f"{self.default_namespace}/{frontend_pod}:{genai_perf_results_path}", - str(job.test_run.output_path / "genai-perf"), - ] - ) - logging.debug(f"Copying genai-perf results with command: {cp_logs_cmd}") - p = subprocess.run(cp_logs_cmd, shell=True, capture_output=True, text=True) + cp_logs_cmd = [ + "kubectl", "cp", + f"{self.default_namespace}/{frontend_pod}:{genai_perf_results_path}", + str(job.test_run.output_path / "genai-perf"), + ] + logging.debug(f"Copying genai-perf results with command: {cp_logs_cmd}") + p = subprocess.run(cp_logs_cmd, capture_output=True, text=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cloudai/systems/kubernetes/kubernetes_system.py` around lines 364 - 374, The cp_logs_cmd is built as a single string and passed to subprocess.run with shell=True, reintroducing shell-injection risk; change cp_logs_cmd to be a list/sequence of arguments (e.g., ["kubectl","cp", f"{self.default_namespace}/{frontend_pod}:{genai_perf_results_path}", str(job.test_run.output_path / "genai-perf")]) and call subprocess.run with shell=False (omit shell=True) and the same capture_output/text args; update the variable names cp_logs_cmd and the subprocess.run invocation in kubernetes_system.py accordingly so it mirrors the earlier cp_wrapper_cmd fix.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@conf/experimental/ai_dynamo/test_scenario/sglang_slurm.toml`:
- Line 31: The comment fragment after the model setting is stale and includes a
dangling double-quote; edit the line with the model key (model =
"deepseek-ai/DeepSeek-R1-Distill-Llama-8B") to remove the leftover comment
fragment `#Qwen/Qwen3-0.6B"` so the comment no longer contains the stray `"` —
leave the model value unchanged and, if needed, replace the fragment with a
clean comment or remove it entirely.
In `@conf/experimental/ai_dynamo/test_scenario/vllm_kvbm_slurm.toml`:
- Line 35: The inline comment on the ingress-cmd entry contains stray
double-quote characters that look like copy-paste artifacts and may confuse
readers; edit the ingress-cmd line so the value remains "python -m
dynamo.frontend --router-mode kv" and remove the extra embedded quotes from the
trailing comment (keep only plain comment text after the #, e.g. --router-mode
kv --no-kv-events --kv-overlap-score-weight=0) so no dangling " characters
remain.
- Around line 54-55: The two settings CUFILE_LOG_LEVEL and CUFILE_LOGGING_LEVEL
look like version-variant names for the same CUFile log-level; update the config
to either consolidate to a single key if only one is required, or retain both
but add a one-line comment above them explaining that both are intentionally set
for cross-version compatibility (e.g., older vs newer CUFile releases) so future
maintainers don’t treat one as a duplicate and remove it; refer to the
CUFILE_LOG_LEVEL and CUFILE_LOGGING_LEVEL entries when making this change.
- Around line 27-29: Remove the user-specific paths left in commented lines by
replacing them with generic placeholders or removing the lines; specifically
update the commented keys storage_cache_dir and hf_home_path (replace
"/raid/users/kapila", "/mnt/vast/kapila", "/mnt/vast/disagg_inf/huggingface"
with e.g. "/path/to/storage_cache" or remove the commented entries) and remove
or genericize the patch reference "clear_kv_blocks_engine_route.patch" (e.g.
"/path/to/patch.patch"); ensure only non-identifying placeholder paths remain in
the vllm_kvbm_slurm.toml commented lines.
In `@src/cloudai/systems/kubernetes/kubernetes_system.py`:
- Around line 326-349: The wrapper_cmd invocation is wrong: change construction
so it passes "--cmd" (e.g., tdef.cmd_args.genai_perf.cmd or "genai-perf
profile") and the server params ("--model", "--url", "--endpoint") before the
"--" separator, use "--report_name" instead of "--report_file", and ensure
arguments after "--" are tokenized as separate "--key" "value" pairs (not
"--key=value"); update genai_perf_cmd_parts to produce
["genai-perf","profile","--artifact-dir", genai_perf_results_path, "--key",
"value", ...] and when extending with extra_args split into tokens (no '=' form)
so parse_genai_perf_args in the pod wrapper receives positional pairs; modify
wrapper_cmd to include "--result_dir", genai_perf_results_path, "--report_name",
report_file, "--gpus_per_node", str(self.gpus_per_node), then the required
"--cmd", "--model", "--url", "--endpoint" entries before the "--", followed by
the tokenized genai_perf_cmd_parts.
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py`:
- Around line 180-194: In populate_prefill_decode_args (the model_validator
method), replace the flagged setattr calls with direct attribute assignment on
prefill_worker.args and decode_worker.args for the constant names (e.g., use
prefill_worker.args.model = self.model and decode_worker.args.model = self.model
when backend == "vllm"), and add a branch for backend == "sglang_dsr1" that
mirrors the "sglang" behavior (assign model-path and served-model-name directly
on prefill_worker.args and decode_worker.args) so TOML values like backend =
"sglang_dsr1" are accepted instead of raising ValueError.
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.sh`:
- Around line 407-423: The function array_to_args() is dead code (never called)
and should be removed or used; either delete array_to_args to clean up the file,
or replace the inline array-to-argument construction in launch_decode,
launch_prefill, and launch_workload with a call to array_to_args by passing the
local associative array (using name reference) and using its output when
building the command; if you keep it, ensure callers reference the same
placeholder-replacement behavior (replace_placeholders) and quoting rules so
behavior is preserved.
- Around line 65-68: The min() function is dead code (SC2329); remove the
function definition for min() to eliminate unused code, or if its behavior is
required, replace inline comparisons with calls to min() from the appropriate
caller(s) and ensure those callers reference min; locate the min() function in
the script and either delete its block or update code paths that need the
smaller-of-two logic to call min() consistently.
- Around line 962-966: The log call references lmcache_path before it’s set, so
move or change the log to use the variable after it’s declared; specifically,
declare local lmcache_path="${lmcache_config["repo"]}" before calling log, then
call log "Setting up LMCache; installing LMCache using: uv pip install
$lmcache_path" and finally run uv pip install -e $lmcache_path
(functions/variables: lmcache_path, lmcache_config, log, uv pip install).
- Around line 1040-1053: The launch invocation is dropping the port because
genai_perf.sh's process_args doesn't handle a standalone --port, so include the
port when building the URL and remove the separate --port argument: change the
--url argument to embed "${dynamo_args["port"]}" (e.g.
"http://${dynamo_args["frontend-node"]}:${dynamo_args["port"]}") and delete the
`--port "${dynamo_args["port"]}"` entry in the bash invocation in ai_dynamo.sh
(the call that runs "${INSTALL_DIR}/$script" from launch_workload), so
genai_perf.sh receives the full address instead of losing the port flag.
- Around line 932-940: The function setup_storage_cache_dir mutates
STORAGE_CACHE_DIR in-place causing path segments to accumulate across calls
(breaks when multiple connectors like lmcache and kvbm are setup); fix it by
reading the immutable base once (e.g., capture a STORAGE_CACHE_BASE or local
base variable from the current STORAGE_CACHE_DIR value) and then build
STORAGE_CACHE_DIR for the current connector as
"${base}/${TEST_USER}/${dynamo_args["frontend-node"]}/${connector}/cache" (do
not append to STORAGE_CACHE_DIR itself), then recreate and chmod that path and
ensure any exported variable (e.g., DYN_KVBM_DISK_CACHE_DIR) uses that
constructed path.
---
Outside diff comments:
In `@src/cloudai/systems/kubernetes/kubernetes_system.py`:
- Around line 364-374: The cp_logs_cmd is built as a single string and passed to
subprocess.run with shell=True, reintroducing shell-injection risk; change
cp_logs_cmd to be a list/sequence of arguments (e.g., ["kubectl","cp",
f"{self.default_namespace}/{frontend_pod}:{genai_perf_results_path}",
str(job.test_run.output_path / "genai-perf")]) and call subprocess.run with
shell=False (omit shell=True) and the same capture_output/text args; update the
variable names cp_logs_cmd and the subprocess.run invocation in
kubernetes_system.py accordingly so it mirrors the earlier cp_wrapper_cmd fix.
---
Duplicate comments:
In `@conf/experimental/ai_dynamo/test_scenario/sglang_slurm.toml`:
- Around line 20-21: The scenario TOML has inconsistent identifiers: id =
"sglang-DeepSeek-R1-Distill-Llama-8B", test_name = "sglang-Qwen3-0.6B", and
model = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"; fix by making test_name
match the actual model/test template (e.g., change test_name to a
DeepSeek-specific template name that corresponds to id/model) or revert model to
the Qwen3 model to match the existing "sglang-Qwen3-0.6B" template—ensure the
three fields id, test_name, and model are aligned so test_name matches the name
of a real test TOML template.
In `@src/cloudai/systems/kubernetes/kubernetes_system.py`:
- Around line 319-323: The current subprocess.run call that executes
kubectl_exec_cmd (the chmod step using frontend_pod) only logs exceptions at
debug and never checks result.returncode; change this to check result.returncode
after subprocess.run and if non-zero log at warning (including
result.returncode, result.stdout, result.stderr and frontend_pod context) and
raise or return an error so callers don't proceed with a non-executable wrapper;
also change the except block to log the caught exception at warning with the pod
name and exception details instead of debug.
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py`:
- Around line 308-313: The validation loop on self.cmd_args.workloads splits by
commas but doesn't trim whitespace, causing lookups against
self.cmd_args.genai_perf.script.src.name to fail for entries like " foo". Update
the parsing to strip whitespace from each item after splitting (e.g., map or
list comprehension that calls .strip() on each workload) and also handle/ignore
empty strings so the for-loop over workloads uses cleaned names when comparing
to self.cmd_args.genai_perf.script.src.name before raising ValueError.
- Around line 167-173: The problem is that prefill_worker/decode_worker are
always truthy due to Field(default_factory=WorkerConfig) so the fallback logic
in constraint_check never runs and None values for
tensor_parallel_size/pipeline_parallel_size cause TypeError; change the
WorkerConfig fields to be Optional[WorkerConfig] with Field(default=None) for
prefill_worker and decode_worker (and adjust any serialization/validation
aliases similarly), or alternatively update constraint_check to defensively use
defaults like (worker.tensor_parallel_size or 1) and
(worker.pipeline_parallel_size or 1) when dereferencing
tensor_parallel_size/pipeline_parallel_size; update type hints for WorkerConfig
usage and tests accordingly so constraint_check can handle missing workers and
None TP/PP without raising.
- Around line 376-406: constraint_check can raise TypeError because
tensor_parallel_size and pipeline_parallel_size may be None even when
prefill_worker/decode_worker exist; update the assignments in constraint_check
to safely coalesce None to sensible defaults (e.g., treat None as 1 for tp/pp)
using explicit fallback of the form prefill_worker.args.tensor_parallel_size or
1 and prefill_worker.args.pipeline_parallel_size or 1 (and same for decode_*),
and ensure num_nodes falls back to 0 if missing (e.g., decode_worker.num_nodes
or 0); modify the variables prefill_tp, decode_tp, prefill_pp, decode_pp,
prefill_nodes, decode_nodes inside constraint_check accordingly so comparisons
and multiplications never operate on None.
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.sh`:
- Around line 882-898: The function _query_frontend is defined but never called;
either remove it or wire it up where frontend queries are performed. Locate the
caller sites that should send chat completion requests (search for usages of
dynamo_args, RESULTS_DIR, or other _query_* functions) and either replace inline
curl logic with a call to _query_frontend or call _query_frontend with the
appropriate arguments (content and max_tokens) from the code path that prepares
user prompts; if the helper is unnecessary, delete _query_frontend to avoid dead
code and update any tests/docs that referenced it.
- Around line 276-286: The variables ENABLE_LMCACHE and ENABLE_KVBM are being
set in _apply_connector_settings but never referenced, triggering an
unused-variable warning; either remove these assignments or make them effective
by exporting or using them in subsequent logic. Update the
_apply_connector_settings function: if these flags are intended to be consumed
by other scripts/processes, export ENABLE_LMCACHE and ENABLE_KVBM (or write them
to the environment/config) when set; if they are only for local conditionals,
replace the assignments with the actual conditional behavior (e.g., use them in
downstream if statements or call the enabling functions). Ensure you modify the
references to ENABLE_LMCACHE and ENABLE_KVBM (the symbols in the function)
consistently so the variables are no longer unused.
- Around line 962-993: The setup currently runs `uv pip install -e
$lmcache_path` at runtime inside the `setup_lmcache` flow, which causes builds
to be non-reproducible and slow; change this by removing the runtime install and
ensuring LMCache is installed at image/build time or validated beforehand:
update `setup_lmcache` to only verify the package is importable (e.g. attempt a
Python import) and emit a clear error directing users to pre-install LMCache, or
gate the install behind an explicit opt-in flag (e.g.
SKIP_RUNTIME_INSTALL=false) so the runtime path does not perform `uv pip
install` by default; reference `setup_lmcache`, `lmcache_path`, and the `uv pip
install -e $lmcache_path` invocation when making the change.
- Around line 831-846: The launch_prefill code has unquoted array expansions
that can break arguments with spaces: in the args_arr population change
args_arr+=($key $(replace_placeholders "...")) to preserve tokens by quoting
each element (use args_arr+=("$key" "$(replace_placeholders
"${prefill_args[$key]}")")). Also quote the array when logging and invoking the
command: replace ${args_arr[*]} and ${args_arr[@]} with quoted forms
("${args_arr[*]}" for a single string in the log if desired, and
"${args_arr[@]}" when passing to the process). Update references in
launch_prefill (args_arr creation, the Prefill cmd log, and the command
invocation) accordingly.
- Around line 1079-1083: The two unguarded cd calls (cd
${dynamo_args["workspace-path"]} inside the _is_vllm/_is_sglang branch and cd
$RESULTS_DIR) can fail silently (SC2164); update both to verify the target
directory exists and abort on failure—e.g., check [ -d "<dir>" ] before changing
directory and/or use cd "<dir>" || { echo "ERROR: cannot cd to <dir>"; exit 1;
}—or use pushd/popd with similar error checks so subsequent steps don't run in
the wrong location. Ensure you reference the exact symbols:
${dynamo_args["workspace-path"]}, _is_vllm, _is_sglang, and RESULTS_DIR when
applying the checks.
- Around line 776-791: The loop building args_arr is unquoted and will split
multi-word values; change the construction to append each key and its replaced
value as separate, quoted array elements (use args_arr+=("$key"
"$(replace_placeholders "${decode_args[$key]}")")), and when expanding use the
proper array expansion "${args_arr[@]}" (not ${args_arr[*]}) in both the log and
the command invocation; also ensure you quote expansions from decode_config
(e.g. use "${decode_config["cmd"]}" and "${decode_config["extra-args"]}"
appropriately) so multi-word args are preserved when launching the decode worker
and redirecting to $log_file.
- Around line 456-466: The unquoted command substitution in kill $(jobs -p) can
cause word-splitting and globbing issues (SC2046); replace it by capturing jobs
-p into an array and only invoking kill when non-empty, e.g. read the PIDs into
a variable/array and call kill with a quoted array expansion so PIDs are passed
safely (avoid unquoted $(jobs -p)); update the block around the
FATAL_ERROR_MARKER handling where kill $(jobs -p) is used (in the same scope as
FATAL_ERROR_MARKER and perform_exit) to collect jobs PIDs and kill them with
proper quoting.
In `@src/cloudai/workloads/ai_dynamo/genai_perf.sh`:
- Around line 78-80: The case arm matching "--*)" in genai_perf.sh currently
does "shift 2", which consumes two args for any unknown flag and can
misinterpret boolean flags; update that branch to check the next positional
argument (e.g., "$2" or "${1#-}") and if it's empty or starts with "-" treat the
flag as boolean and only "shift 1", otherwise treat it as a key/value and "shift
2" so you don't accidentally consume the following real option.
- Around line 90-97: The triple-quoted string passed to log in genai_perf.sh is
invalid Bash; replace the log """...""" block with a valid Bash multiline
string—e.g., use a here-doc or multiple echo/printf calls. Locate the invocation
that starts with log """Parsed args: and rewrite it to either: 1) use a here-doc
passed to log (cat <<'EOF' ... EOF) or 2) call log with a single-quoted here-doc
to preserve literal content if needed, ensuring the genai_perf_args expansion
(the $(for key ... ) subshell) is evaluated appropriately before feeding the
here-doc, or 3) build the string with printf/echo lines and then call log, so
the script no longer uses triple quotes.
- Around line 100-117: In process_results, empty grep/cut results (for
output_tokens_per_second, request_throughput, request_count) and a zero
total_gpus can cause silent arithmetic errors; fix by validating/parsing each
extracted value before using it: check that profile_path exists, test that
grep/cut produced non-empty numeric strings for output_tokens_per_second,
request_throughput and request_count and fall back to 0 (or log an error) if not
numeric, and ensure total_gpus is >0 before dividing (use 1 as safe default or
skip per-GPU calc); update the logic around output_tokens_per_second_per_gpu
computation and the echo lines to use the sanitized/validated variables in
function process_results.
- Around line 138-144: The use of eval on the assembled string full_cmd (built
from cmd and cmdline_args) introduces command-injection risk; replace eval
"$full_cmd" with a safe exec of an argument array: build an array (e.g., args=(
"$cmd" "${cmdline_args[@]}" ) or parse cmdline_args into properly quoted
elements) and call the executable directly (e.g., exec "${args[@]}") so
arbitrary shell metacharacters in cmdline_args cannot be executed; update the
launch/logging around log "Launching genai-perf with args: $full_cmd" to log the
safely quoted arguments instead of relying on eval.
In `@tests/ref_data/ai-dynamo.sbatch`:
- Around line 17-26: The srun call contains a duplicated flag --export=ALL which
causes the same option to be passed twice; modify the srun argument list (the
block with srun and its flags) to remove the redundant --export=ALL occurrence
so the flag appears only once, keeping the other options (e.g., --mpi=pmix, -N2,
--container-image, --container-mounts, --nodes, --ntasks, --ntasks-per-node)
unchanged.
- Line 74: The CLI argument --genai_perf-extra-inputs is using a double-quoted
string that contains JSON with inner double quotes, which gets corrupted by the
shell; fix the invocation in the sbatch script by either wrapping the JSON value
in single quotes (e.g. --genai_perf-extra-inputs '{"temperature": 0.7,
"max_tokens": 128}') or by escaping the inner double quotes so the JSON is
passed intact to the program.
| workloads = "genai_perf.sh" | ||
|
|
||
| [Tests.cmd_args.dynamo] | ||
| model = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" #Qwen/Qwen3-0.6B" |
There was a problem hiding this comment.
Remove the stale comment fragment with dangling ".
The comment #Qwen/Qwen3-0.6B" is a leftover from a prior edit — the trailing " is part of the comment string, making it a confusing artifact.
🧹 Proposed fix
- model = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" `#Qwen/Qwen3-0.6B`"
+ model = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| model = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" #Qwen/Qwen3-0.6B" | |
| model = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@conf/experimental/ai_dynamo/test_scenario/sglang_slurm.toml` at line 31, The
comment fragment after the model setting is stale and includes a dangling
double-quote; edit the line with the model key (model =
"deepseek-ai/DeepSeek-R1-Distill-Llama-8B") to remove the leftover comment
fragment `#Qwen/Qwen3-0.6B"` so the comment no longer contains the stray `"` —
leave the model value unchanged and, if needed, replace the fragment with a
clean comment or remove it entirely.
| #storage_cache_dir = "/raid/users/kapila" | ||
| #storage_cache_dir = "/mnt/vast/kapila" | ||
| #hf_home_path = "/mnt/vast/disagg_inf/huggingface" |
There was a problem hiding this comment.
Remove user-specific paths from commented-out lines.
Lines 27–28 and 36 contain paths that identify a specific user (/raid/users/kapila, /mnt/vast/kapila, /cloudai_install/clear_kv_blocks_engine_route.patch). These are development-time artifacts and should be removed or replaced with generic placeholder paths before merging into a public repository.
🛡️ Proposed fix
- `#storage_cache_dir` = "/raid/users/kapila"
- `#storage_cache_dir` = "/mnt/vast/kapila"
+ `#storage_cache_dir` = "/path/to/storage_cache"
`#hf_home_path` = "/mnt/vast/disagg_inf/huggingface"- `#node-setup-cmd` = "/usr/local/ucx/bin/ucx_info -d |grep Transport | sort -u; (cd /opt/dynamo/venv/lib/python3.12/site-packages/dynamo && patch -p4 < /cloudai_install/clear_kv_blocks_engine_route.patch)"
+ `#node-setup-cmd` = "/usr/local/ucx/bin/ucx_info -d | grep Transport | sort -u"Also applies to: 36-36
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@conf/experimental/ai_dynamo/test_scenario/vllm_kvbm_slurm.toml` around lines
27 - 29, Remove the user-specific paths left in commented lines by replacing
them with generic placeholders or removing the lines; specifically update the
commented keys storage_cache_dir and hf_home_path (replace "/raid/users/kapila",
"/mnt/vast/kapila", "/mnt/vast/disagg_inf/huggingface" with e.g.
"/path/to/storage_cache" or remove the commented entries) and remove or
genericize the patch reference "clear_kv_blocks_engine_route.patch" (e.g.
"/path/to/patch.patch"); ensure only non-identifying placeholder paths remain in
the vllm_kvbm_slurm.toml commented lines.
|
|
||
| [Tests.cmd_args.dynamo] | ||
| model = "Qwen/Qwen3-0.6B" | ||
| ingress-cmd = "python -m dynamo.frontend --router-mode kv" # --router-mode kv --no-kv-events --kv-overlap-score-weight=0" # --router-mode kv" |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Stray quote characters in the inline comment are misleading copy-paste artifacts.
The trailing comment on ingress-cmd contains embedded " characters (...--kv-overlap-score-weight=0" # --router-mode kv") that are remnants of editing and could confuse readers into thinking the string value is longer than it is.
♻️ Proposed cleanup
- ingress-cmd = "python -m dynamo.frontend --router-mode kv" # --router-mode kv --no-kv-events --kv-overlap-score-weight=0" # --router-mode kv"
+ ingress-cmd = "python -m dynamo.frontend --router-mode kv" # alternatives: --no-kv-events --kv-overlap-score-weight=0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ingress-cmd = "python -m dynamo.frontend --router-mode kv" # --router-mode kv --no-kv-events --kv-overlap-score-weight=0" # --router-mode kv" | |
| ingress-cmd = "python -m dynamo.frontend --router-mode kv" # alternatives: --no-kv-events --kv-overlap-score-weight=0 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@conf/experimental/ai_dynamo/test_scenario/vllm_kvbm_slurm.toml` at line 35,
The inline comment on the ingress-cmd entry contains stray double-quote
characters that look like copy-paste artifacts and may confuse readers; edit the
ingress-cmd line so the value remains "python -m dynamo.frontend --router-mode
kv" and remove the extra embedded quotes from the trailing comment (keep only
plain comment text after the #, e.g. --router-mode kv --no-kv-events
--kv-overlap-score-weight=0) so no dangling " characters remain.
| CUFILE_LOG_LEVEL = "INFO" | ||
| CUFILE_LOGGING_LEVEL = "INFO" |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
CUFILE_LOG_LEVEL and CUFILE_LOGGING_LEVEL are both set — clarify intent.
These appear to be version-variant names for the same CUFile log-level setting. If both are required for cross-version compatibility, a brief comment explaining this would prevent future readers from treating one as a duplicate and removing it.
♻️ Suggested clarification
+ # Set both variants for CUFile log level — name changed between CUFile versions
CUFILE_LOG_LEVEL = "INFO"
CUFILE_LOGGING_LEVEL = "INFO"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| CUFILE_LOG_LEVEL = "INFO" | |
| CUFILE_LOGGING_LEVEL = "INFO" | |
| # Set both variants for CUFile log level — name changed between CUFile versions | |
| CUFILE_LOG_LEVEL = "INFO" | |
| CUFILE_LOGGING_LEVEL = "INFO" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@conf/experimental/ai_dynamo/test_scenario/vllm_kvbm_slurm.toml` around lines
54 - 55, The two settings CUFILE_LOG_LEVEL and CUFILE_LOGGING_LEVEL look like
version-variant names for the same CUFile log-level; update the config to either
consolidate to a single key if only one is required, or retain both but add a
one-line comment above them explaining that both are intentionally set for
cross-version compatibility (e.g., older vs newer CUFile releases) so future
maintainers don’t treat one as a duplicate and remove it; refer to the
CUFILE_LOG_LEVEL and CUFILE_LOGGING_LEVEL entries when making this change.
| genai_perf_cmd_parts = ["genai-perf", "profile", f"--artifact-dir={genai_perf_results_path}"] | ||
| if tdef.cmd_args.genai_perf.args: | ||
| for k, v in tdef.cmd_args.genai_perf.args.model_dump(exclude_none=True).items(): | ||
| genai_perf_cmd_parts.append(f"--{k}={v}") | ||
| if extra_args := tdef.cmd_args.genai_perf.extra_args: | ||
| if isinstance(extra_args, str): | ||
| genai_perf_cmd_parts.extend(extra_args.split()) | ||
| else: | ||
| genai_perf_cmd_parts.extend(extra_args) | ||
|
|
||
| # Build wrapper command with proper parameters | ||
| report_file = "genai_perf_report.csv" | ||
| wrapper_cmd = [ | ||
| "/bin/bash", | ||
| pod_wrapper_path, | ||
| "--result_dir", | ||
| genai_perf_results_path, | ||
| "--report_file", | ||
| report_file, | ||
| "--gpus_per_node", | ||
| str(self.gpus_per_node), | ||
| "--", | ||
| *genai_perf_cmd_parts, | ||
| ] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, find the genai_perf.sh file
echo "=== Locating genai_perf.sh ==="
find . -name "genai_perf.sh" -type f
# Then search for the patterns mentioned
echo ""
echo "=== Examining genai_perf.sh parser ==="
rg -n 'parse_genai_perf_args|cmd=|full_cmd|eval' --type-add 'bash:*.sh' -t bash 'src/cloudai/workloads/ai_dynamo/genai_perf.sh' -A 5 -B 2
echo ""
echo "=== Examining wrapper_cmd in kubernetes_system.py ==="
rg -n 'wrapper_cmd|report_file|report_name|genai_perf_cmd_parts' src/cloudai/systems/kubernetes/kubernetes_system.py -B 2 -A 2Repository: NVIDIA/cloudai
Length of output: 3295
🏁 Script executed:
#!/bin/bash
# Let's check the exact structure of genai_perf.sh to understand the parser
echo "=== Full genai_perf.sh content (focusing on parse_genai_perf_args and cmd handling) ==="
wc -l src/cloudai/workloads/ai_dynamo/genai_perf.sh
# Get the entire script to understand the flow
head -100 src/cloudai/workloads/ai_dynamo/genai_perf.shRepository: NVIDIA/cloudai
Length of output: 2106
🏁 Script executed:
#!/bin/bash
# Let's see the full kubernetes_system.py to understand the wrapper_cmd construction context
echo "=== kubernetes_system.py lines 320-360 ==="
sed -n '320,360p' src/cloudai/systems/kubernetes/kubernetes_system.pyRepository: NVIDIA/cloudai
Length of output: 2100
🏁 Script executed:
#!/bin/bash
# Check where tdef.cmd_args.genai_perf.args comes from
echo "=== Searching for GenaiPerfArgs class definition ==="
rg -n 'class GenaiPerfArgs|model.*:|url.*:|endpoint.*:' -A 3 src/ --type py | head -50
echo ""
echo "=== Searching for how genai_perf args are populated in SLURM path for comparison ==="
rg -n 'genai_perf_cmd_parts\|launch_workload' src/cloudai/systems/slurm/ --type py -B 2 -A 5Repository: NVIDIA/cloudai
Length of output: 3570
🏁 Script executed:
#!/bin/bash
# Get more context on GenaiPerfArgs structure
echo "=== Searching for GenaiPerfArgs definition ==="
find src -name "*.py" -type f | xargs grep -l "GenaiPerfArgs" | head -5
# Look at one of those files
echo ""
echo "=== Checking cmd_args structure ==="
rg -n 'class.*Genai|model|url|endpoint' src/cloudai/command_line_args.py -A 5 | head -80Repository: NVIDIA/cloudai
Length of output: 231
🏁 Script executed:
#!/bin/bash
# Find files with GenaiPerfArgs
echo "=== Finding GenaiPerfArgs definition ==="
rg -n 'class GenaiPerfArgs|GenaiPerfArgs' src/ --type py -A 10 | head -60
echo ""
echo "=== Finding how genai_perf is used in SLURM ==="
find src -path "*slurm*" -name "*.py" -type f | xargs rg -l 'genai_perf' | head -3Repository: NVIDIA/cloudai
Length of output: 209
🏁 Script executed:
#!/bin/bash
# Search for GenaiPerfArgs in all Python files
echo "=== All references to GenaiPerfArgs ==="
rg 'GenaiPerfArgs' src/ --type py
echo ""
echo "=== Search for genai_perf class definition ==="
rg 'class.*[Gg]enai.*[Pp]erf' src/ --type py -A 15Repository: NVIDIA/cloudai
Length of output: 145
🏁 Script executed:
#!/bin/bash
# Let me find the AI Dynamo workload definition
echo "=== Finding AI Dynamo related files ==="
find src -name "*ai_dynamo*" -o -name "*genai*" | grep -v __pycache__
echo ""
echo "=== Checking the AI Dynamo workload file ==="
head -150 src/cloudai/workloads/ai_dynamo/ai_dynamo.pyRepository: NVIDIA/cloudai
Length of output: 5186
🏁 Script executed:
#!/bin/bash
# Continue reading ai_dynamo.py for GenaiPerfArgs
echo "=== Searching for GenaiPerfArgs in ai_dynamo.py ==="
rg -n 'class.*Perf|class.*Genai' src/cloudai/workloads/ai_dynamo/ai_dynamo.py -A 20
echo ""
echo "=== Search for 'genai_perf' in ai_dynamo.py ==="
rg -n 'genai_perf' src/cloudai/workloads/ai_dynamo/ai_dynamo.py -B 2 -A 5Repository: NVIDIA/cloudai
Length of output: 2771
🏁 Script executed:
#!/bin/bash
# Let's look at the entire ai_dynamo.py file
echo "=== Full ai_dynamo.py ==="
wc -l src/cloudai/workloads/ai_dynamo/ai_dynamo.py
tail -300 src/cloudai/workloads/ai_dynamo/ai_dynamo.pyRepository: NVIDIA/cloudai
Length of output: 11621
🏁 Script executed:
#!/bin/bash
# Search for where cmd_args are defined
echo "=== Search for cmd_args in ai_dynamo.py ==="
rg -n 'cmd_args|CmdArgs' src/cloudai/workloads/ai_dynamo/ai_dynamo.py -B 2 -A 3
echo ""
echo "=== Search for how SLURM passes genai-perf cmd ==="
rg -n 'genai-perf|genai_perf' src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py -B 2 -A 8Repository: NVIDIA/cloudai
Length of output: 3960
🏁 Script executed:
#!/bin/bash
# Check if GenAIPerf has any args field defined
echo "=== Verify GenAIPerf args field ==="
sed -n '239,250p' src/cloudai/workloads/ai_dynamo/ai_dynamo.py
echo ""
echo "=== Check Workload class definition for args ==="
sed -n '44,75p' src/cloudai/workloads/ai_dynamo/ai_dynamo.pyRepository: NVIDIA/cloudai
Length of output: 1500
🏁 Script executed:
#!/bin/bash
# Let's verify how genai_perf.args are used in kubernetes_system.py
echo "=== Full context of the genai_perf invocation in kubernetes_system.py ==="
sed -n '300,360p' src/cloudai/systems/kubernetes/kubernetes_system.pyRepository: NVIDIA/cloudai
Length of output: 3004
🏁 Script executed:
#!/bin/bash
# Verify what tdef.cmd_args.genai_perf.args would contain
echo "=== Checking if GenAIPerf defines args ==="
rg -n 'class GenAIPerf' src/cloudai/workloads/ai_dynamo/ai_dynamo.py -A 10
echo ""
echo "=== Check if genai_perf.args is populated anywhere ==="
rg -n 'genai_perf\.args' src/cloudai/workloads/ai_dynamo/ --type pyRepository: NVIDIA/cloudai
Length of output: 519
Kubernetes genai_perf.sh invocation is completely broken — --cmd, --model, --url are never set, --key=value format is incompatible with the script's parser, and --report_file is the wrong parameter name.
Four interacting defects cause the Kubernetes genai-perf execution to fail:
1. --cmd is never passed. wrapper_cmd passes --result_dir, --report_file, --gpus_per_node, then --. genai_perf.sh's process_args() expects --cmd before the -- separator (line 65–67). Since it's missing, cmd="" remains, and main() builds full_cmd="$cmd $cmdline_args" → the command starts with a space and eval fails with "command not found".
2. --key=value format is incompatible with parse_genai_perf_args. Line 329 builds arguments as f"--{k}={v}", but the script's parser (line 23–24) does genai_perf_args["${1}"]="$2", expecting positional --key value pairs. Every "--artifact-dir=/tmp/..." element is treated as a key and the next list element is consumed as its value, garbling the entire argument mapping.
3. --model, --url, --endpoint are not passed before --. These must be handled by process_args() before the -- separator (lines 35–49) to populate the wrapper's internal args. They are completely omitted, so genai-perf runs without server coordinates.
4. --report_file vs --report_name. Line 343 passes "--report_file" but genai_perf.sh handles "--report_name" (line 52–54). The parameter is silently dropped and defaults to "genai_perf_report.csv".
Example of required fix
The wrapper invocation must:
- Pass
--cmd "genai-perf profile"before-- - Pass
--model,--url(with port),--endpointbefore-- - Pass
--report_name(not--report_file) - Format args after
--as--key valuepairs (not--key=value)
wrapper_cmd = [
"/bin/bash", pod_wrapper_path,
"--result_dir", genai_perf_results_path,
"--report_name", report_file,
"--gpus_per_node", str(self.gpus_per_node),
"--cmd", tdef.cmd_args.genai_perf.cmd,
"--model", tdef.cmd_args.dynamo.model,
"--url", f"http://{frontend_pod}:{tdef.cmd_args.dynamo.port}",
"--endpoint", ...,
"--",
"genai-perf", "profile",
"--artifact-dir", genai_perf_results_path,
# ... other args as --key value, not --key=value
]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cloudai/systems/kubernetes/kubernetes_system.py` around lines 326 - 349,
The wrapper_cmd invocation is wrong: change construction so it passes "--cmd"
(e.g., tdef.cmd_args.genai_perf.cmd or "genai-perf profile") and the server
params ("--model", "--url", "--endpoint") before the "--" separator, use
"--report_name" instead of "--report_file", and ensure arguments after "--" are
tokenized as separate "--key" "value" pairs (not "--key=value"); update
genai_perf_cmd_parts to produce ["genai-perf","profile","--artifact-dir",
genai_perf_results_path, "--key", "value", ...] and when extending with
extra_args split into tokens (no '=' form) so parse_genai_perf_args in the pod
wrapper receives positional pairs; modify wrapper_cmd to include "--result_dir",
genai_perf_results_path, "--report_name", report_file, "--gpus_per_node",
str(self.gpus_per_node), then the required "--cmd", "--model", "--url",
"--endpoint" entries before the "--", followed by the tokenized
genai_perf_cmd_parts.
| function min() | ||
| { | ||
| echo "$(( $1 < $2 ? $1 : $2 ))" | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
min() is dead code (SC2329).
This function is never called anywhere in the script. Either integrate it or remove it.
🧰 Tools
🪛 Shellcheck (0.11.0)
[info] 65-68: This function is never invoked. Check usage (or ignored if invoked indirectly).
(SC2329)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.sh` around lines 65 - 68, The min()
function is dead code (SC2329); remove the function definition for min() to
eliminate unused code, or if its behavior is required, replace inline
comparisons with calls to min() from the appropriate caller(s) and ensure those
callers reference min; locate the min() function in the script and either delete
its block or update code paths that need the smaller-of-two logic to call min()
consistently.
| function array_to_args() | ||
| { | ||
| local -n arr=$1 | ||
| local result="" | ||
| for key in "${!arr[@]}"; do | ||
| if [[ "$key" == "--extra-args" ]] || \ | ||
| [[ "$key" == "--num-nodes" ]] || \ | ||
| [[ "$key" == "--nodes" ]]; then | ||
| continue | ||
| shopt -s nocasematch | ||
| val=$(replace_placeholders "${arr[$key]}") | ||
| # Quote values that contain spaces | ||
| if [[ "$val" == *" "* ]]; then | ||
| val="${val//\"/\\\"}" # Escape existing quotes | ||
| result+="${key} \"${val}\" " | ||
| else | ||
| result+="${key} ${arr[$key]} " | ||
| result+="${key} ${val} " | ||
| fi | ||
| done | ||
| echo "$result" | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
array_to_args() is dead code (SC2329).
This function was introduced in this PR but is never called; launch_decode, launch_prefill, and launch_workload all build their own inline arrays. Remove or integrate it.
🧰 Tools
🪛 Shellcheck (0.11.0)
[info] 407-423: This function is never invoked. Check usage (or ignored if invoked indirectly).
(SC2329)
[warning] 409-409: Variable was used as an array but is now assigned a string.
(SC2178)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.sh` around lines 407 - 423, The
function array_to_args() is dead code (never called) and should be removed or
used; either delete array_to_args to clean up the file, or replace the inline
array-to-argument construction in launch_decode, launch_prefill, and
launch_workload with a call to array_to_args by passing the local associative
array (using name reference) and using its output when building the command; if
you keep it, ensure callers reference the same placeholder-replacement behavior
(replace_placeholders) and quoting rules so behavior is preserved.
| function setup_storage_cache_dir() | ||
| { | ||
| local connector="$1" | ||
| # Use a global variable that can be exported | ||
| STORAGE_CACHE_DIR="$STORAGE_CACHE_DIR/${TEST_USER}/${dynamo_args["frontend-node"]}/${connector}/cache" | ||
| rm -rf "${STORAGE_CACHE_DIR}" | ||
| mkdir -p "${STORAGE_CACHE_DIR}" | ||
| chmod 755 "${STORAGE_CACHE_DIR}" | ||
| } |
There was a problem hiding this comment.
setup_storage_cache_dir accumulates path segments on each call — breaks when both lmcache and kvbm connectors are enabled.
STORAGE_CACHE_DIR is modified in-place: each call appends /${TEST_USER}/${frontend}/${connector}/cache to the current value. When setup_lmcache and setup_kvbm are both invoked (frontend-node path in main), the second call compounds the path:
- After
setup_lmcache:/cloudai_install/storage_cache/<user>/<frontend>/lmcache/cache - After
setup_kvbm(modifier applied to already-modified path):/cloudai_install/storage_cache/<user>/<frontend>/lmcache/cache/<user>/<frontend>/kvbm/cache
DYN_KVBM_DISK_CACHE_DIR then exports the wrong path.
🐛 Proposed fix
Capture the original base once (after CLI parsing overwrites it) and use it as the immutable base in every call:
function parse_args()
{
_parse_cli_pairs "$@"
+ # Freeze the configured base before any connector setup mutates STORAGE_CACHE_DIR.
+ STORAGE_CACHE_DIR_BASE="${STORAGE_CACHE_DIR}"
_set_nodelists
...
}
function setup_storage_cache_dir()
{
local connector="$1"
- # Use a global variable that can be exported
- STORAGE_CACHE_DIR="$STORAGE_CACHE_DIR/${TEST_USER}/${dynamo_args["frontend-node"]}/${connector}/cache"
+ STORAGE_CACHE_DIR="${STORAGE_CACHE_DIR_BASE}/${TEST_USER}/${dynamo_args["frontend-node"]}/${connector}/cache"
rm -rf "${STORAGE_CACHE_DIR}"
mkdir -p "${STORAGE_CACHE_DIR}"
chmod 755 "${STORAGE_CACHE_DIR}"
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.sh` around lines 932 - 940, The
function setup_storage_cache_dir mutates STORAGE_CACHE_DIR in-place causing path
segments to accumulate across calls (breaks when multiple connectors like
lmcache and kvbm are setup); fix it by reading the immutable base once (e.g.,
capture a STORAGE_CACHE_BASE or local base variable from the current
STORAGE_CACHE_DIR value) and then build STORAGE_CACHE_DIR for the current
connector as
"${base}/${TEST_USER}/${dynamo_args["frontend-node"]}/${connector}/cache" (do
not append to STORAGE_CACHE_DIR itself), then recreate and chmod that path and
ensure any exported variable (e.g., DYN_KVBM_DISK_CACHE_DIR) uses that
constructed path.
| _require_cmd uv | ||
| log "Setting up LMCache; installing LMCache using: uv pip install $lmcache_path" | ||
| local lmcache_path="${lmcache_config["repo"]}" | ||
| uv pip install -e $lmcache_path | ||
|
|
There was a problem hiding this comment.
$lmcache_path referenced in log message before it is declared.
Line 963 logs $lmcache_path, but the variable is only declared on line 964. The log will print an empty string (or inherit an unexpected outer-scope value).
🐛 Proposed fix
- _require_cmd uv
- log "Setting up LMCache; installing LMCache using: uv pip install $lmcache_path"
- local lmcache_path="${lmcache_config["repo"]}"
- uv pip install -e $lmcache_path
+ _require_cmd uv
+ local lmcache_path="${lmcache_config["repo"]}"
+ log "Setting up LMCache; installing LMCache using: uv pip install $lmcache_path"
+ uv pip install -e "$lmcache_path"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _require_cmd uv | |
| log "Setting up LMCache; installing LMCache using: uv pip install $lmcache_path" | |
| local lmcache_path="${lmcache_config["repo"]}" | |
| uv pip install -e $lmcache_path | |
| _require_cmd uv | |
| local lmcache_path="${lmcache_config["repo"]}" | |
| log "Setting up LMCache; installing LMCache using: uv pip install $lmcache_path" | |
| uv pip install -e "$lmcache_path" |
🧰 Tools
🪛 Shellcheck (0.11.0)
[info] 965-965: Double quote to prevent globbing and word splitting.
(SC2086)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.sh` around lines 962 - 966, The log
call references lmcache_path before it’s set, so move or change the log to use
the variable after it’s declared; specifically, declare local
lmcache_path="${lmcache_config["repo"]}" before calling log, then call log
"Setting up LMCache; installing LMCache using: uv pip install $lmcache_path" and
finally run uv pip install -e $lmcache_path (functions/variables: lmcache_path,
lmcache_config, log, uv pip install).
| bash "${INSTALL_DIR}/$script" \ | ||
| --install_dir "$INSTALL_DIR" \ | ||
| --result_dir "$RESULTS_DIR" \ | ||
| --model "${dynamo_args["model"]}" \ | ||
| --url "http://${dynamo_args["frontend-node"]}" \ | ||
| --port "${dynamo_args["port"]}" \ | ||
| --endpoint "${dynamo_args["endpoint"]}" \ | ||
| --gpus_per_node "$(_gpus_per_node)" \ | ||
| --decode-connector "${decode_args["--connector"]}" \ | ||
| --prefill-connector "${prefill_args["--connector"]}" \ | ||
| --kvbm_metrics_port "${DYN_KVBM_METRICS_PORT:-6880}" \ | ||
| --decode-nodes "${decode_config["node-list"]}" \ | ||
| "${config_arr[@]}" \ | ||
| -- "${args_arr[@]}" > "${RESULTS_DIR}/$workload_name.log" 2>&1 |
There was a problem hiding this comment.
Port is silently dropped from the genai-perf URL.
launch_workload passes --url "http://${dynamo_args["frontend-node"]}" (no port) and --port "${dynamo_args["port"]}" separately. genai_perf.sh's process_args has no --port case; the flag falls into the --*) shift 2 catch-all and is silently consumed. The constructed --url therefore defaults to port 80, which diverges from dynamo's actual port (default $((8080 + SLURM_JOBID % 100)) or whatever --dynamo-port specifies).
🐛 Proposed fix
Embed the port in the URL so genai_perf.sh receives the complete address, and drop the now-redundant --port argument:
bash "${INSTALL_DIR}/$script" \
--install_dir "$INSTALL_DIR" \
--result_dir "$RESULTS_DIR" \
--model "${dynamo_args["model"]}" \
- --url "http://${dynamo_args["frontend-node"]}" \
- --port "${dynamo_args["port"]}" \
+ --url "http://${dynamo_args["frontend-node"]}:${dynamo_args["port"]}" \
--endpoint "${dynamo_args["endpoint"]}" \📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| bash "${INSTALL_DIR}/$script" \ | |
| --install_dir "$INSTALL_DIR" \ | |
| --result_dir "$RESULTS_DIR" \ | |
| --model "${dynamo_args["model"]}" \ | |
| --url "http://${dynamo_args["frontend-node"]}" \ | |
| --port "${dynamo_args["port"]}" \ | |
| --endpoint "${dynamo_args["endpoint"]}" \ | |
| --gpus_per_node "$(_gpus_per_node)" \ | |
| --decode-connector "${decode_args["--connector"]}" \ | |
| --prefill-connector "${prefill_args["--connector"]}" \ | |
| --kvbm_metrics_port "${DYN_KVBM_METRICS_PORT:-6880}" \ | |
| --decode-nodes "${decode_config["node-list"]}" \ | |
| "${config_arr[@]}" \ | |
| -- "${args_arr[@]}" > "${RESULTS_DIR}/$workload_name.log" 2>&1 | |
| bash "${INSTALL_DIR}/$script" \ | |
| --install_dir "$INSTALL_DIR" \ | |
| --result_dir "$RESULTS_DIR" \ | |
| --model "${dynamo_args["model"]}" \ | |
| --url "http://${dynamo_args["frontend-node"]}:${dynamo_args["port"]}" \ | |
| --endpoint "${dynamo_args["endpoint"]}" \ | |
| --gpus_per_node "$(_gpus_per_node)" \ | |
| --decode-connector "${decode_args["--connector"]}" \ | |
| --prefill-connector "${prefill_args["--connector"]}" \ | |
| --kvbm_metrics_port "${DYN_KVBM_METRICS_PORT:-6880}" \ | |
| --decode-nodes "${decode_config["node-list"]}" \ | |
| "${config_arr[@]}" \ | |
| -- "${args_arr[@]}" > "${RESULTS_DIR}/$workload_name.log" 2>&1 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.sh` around lines 1040 - 1053, The
launch invocation is dropping the port because genai_perf.sh's process_args
doesn't handle a standalone --port, so include the port when building the URL
and remove the separate --port argument: change the --url argument to embed
"${dynamo_args["port"]}" (e.g.
"http://${dynamo_args["frontend-node"]}:${dynamo_args["port"]}") and delete the
`--port "${dynamo_args["port"]}"` entry in the bash invocation in ai_dynamo.sh
(the call that runs "${INSTALL_DIR}/$script" from launch_workload), so
genai_perf.sh receives the full address instead of losing the port flag.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py`:
- Around line 183-194: Replace the setattr calls that use the valid Python
identifier "model" with direct assignments (set self.prefill_worker.args.model =
self.model and self.decode_worker.args.model = self.model) and remove the
trailing "# noqa: B010" from those lines; keep using setattr for the hyphenated
keys ("model-path" and "served-model-name") on self.prefill_worker.args and
self.decode_worker.args (but remove the unnecessary "# noqa: B010" comments
there as well) so the hyphenated attributes remain accessed via
setattr(self.prefill_worker.args, "model-path", self.model) and
setattr(self.decode_worker.args, "model-path", self.model) and similarly for
"served-model-name", leaving the backend check, ValueError and return self
unchanged.
---
Duplicate comments:
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py`:
- Around line 308-313: The validation loop splits self.cmd_args.workloads on
commas but doesn't trim whitespace, so entries like " other.sh" won't match
self.cmd_args.genai_perf.script.src.name; update the logic that creates/iterates
workloads (the variable workloads from self.cmd_args.workloads.split(",")) to
strip each item (e.g., map/strip or list comprehension) before comparing to
self.cmd_args.genai_perf.script.src.name and then raise ValueError only for
truly unmatched, ensuring comparisons use the trimmed workload string.
- Around line 167-173: The prefill_worker field is always truthy because it uses
default_factory=WorkerConfig, which prevents aggregated-mode detection; change
prefill_worker to Optional[WorkerConfig] = None (remove default_factory and set
default None, using typing.Optional) so downstream checks like if
prefill_worker: work as intended, and similarly consider making decode_worker
Optional[WorkerConfig] if needed; update any Field(...) metadata
(serialization_alias/validation_alias) accordingly and then adjust code in
constraint_check, command-generation logic, and tests to handle prefill_worker
being None (add explicit None checks or safe attribute access) so
aggregated/disaggregated mode detection behaves correctly.
- Around line 376-394: constraint_check can raise TypeError because WorkerConfig
fields tensor_parallel_size and pipeline_parallel_size may be None; update
constraint_check (method name constraint_check in class handling TestRun tr) to
coerce those possibly-None fields to safe integer defaults before comparisons
and arithmetic (e.g., tp = (prefill_worker.args.tensor_parallel_size or 1), pp =
(prefill_worker.args.pipeline_parallel_size or 1) and similarly for
decode_worker), keep existing fallback logic for num_nodes, and then perform the
existing checks against self.constraints (prefill_tp_le_decode_tp and
tp_times_pp_le_gpus_per_node) using these non-None ints so comparisons and
multiplications (prefill_tp * prefill_pp, decode_tp * decode_pp) never raise.
- Around line 191-192: The function currently raises ValueError for any backend
other than "vllm" or "sglang", but "sglang_dsr1" is a valid backend per
_is_sglang_dsr1() and _set_backend_defaults(); update the backend dispatch in
ai_dynamo.py to handle "sglang_dsr1" (either as an alias of "sglang" or with its
specific setup) instead of falling into the else branch so TOML configs with
backend = "sglang_dsr1" parse correctly—modify the branch that now raises
ValueError(f"Invalid backend: {self.backend}") to include a case for
"sglang_dsr1" and reuse the same initialization/validation logic used for
"sglang" or call the dedicated sglang_dsr1 setup helper.
In `@tests/ref_data/ai-dynamo.sbatch`:
- Line 76: The --genai_perf-extra-inputs argument in the sbatch line currently
embeds JSON with unescaped double quotes causing bash to break the string; fix
it by either wrapping the entire JSON payload in single quotes (e.g.
'--genai_perf-extra-inputs '\''{"temperature":0.7,"max_tokens":128}'\'') or by
escaping the inner double quotes (e.g. --genai_perf-extra-inputs
"{\"temperature\": 0.7, \"max_tokens\": 128}") so the JSON remains valid when
parsed; update the invocation that contains --genai_perf-extra-inputs to use one
of these safe quoting approaches.
- Around line 17-26: The srun invocation contains a duplicated --export=ALL
because gen_srun_prefix() already emits it and slurm_command_gen_strategy.py
appends it again; update slurm_command_gen_strategy.py to stop adding
--export=ALL (or perform a simple dedupe) when building the srun args: locate
the code path that constructs the srun arguments in slurm_command_gen_strategy
(referencing the function/method that calls gen_srun_prefix() and the code that
appends extra flags) and either remove the explicit addition of "--export=ALL"
or filter the assembled argument list to remove duplicates before joining into
the final command string.
| if self.backend.lower() == "vllm": | ||
| setattr(self.prefill_worker.args, "model", self.model) # noqa: B010 | ||
| setattr(self.decode_worker.args, "model", self.model) # noqa: B010 | ||
| elif self.backend.lower() == "sglang": | ||
| setattr(self.prefill_worker.args, "model-path", self.model) # noqa: B010 | ||
| setattr(self.decode_worker.args, "model-path", self.model) # noqa: B010 | ||
| setattr(self.prefill_worker.args, "served-model-name", self.model) # noqa: B010 | ||
| setattr(self.decode_worker.args, "served-model-name", self.model) # noqa: B010 | ||
| else: | ||
| raise ValueError(f"Invalid backend: {self.backend}") | ||
|
|
||
| return self |
There was a problem hiding this comment.
Pipeline failure: remove unused # noqa: B010 on lines 187–190; apply direct assignment on lines 184–185.
Lines 187–190 use setattr with hyphenated attribute names ("model-path", "served-model-name") that cannot be expressed as direct assignments—ruff knows B010 doesn't apply here, so the # noqa: B010 directives are flagged as RUF100 by CI.
Lines 184–185 use setattr with "model", which is a valid Python identifier and can be a direct assignment. Replace with direct assignment and drop the # noqa: B010.
Proposed fix
if self.backend.lower() == "vllm":
- setattr(self.prefill_worker.args, "model", self.model) # noqa: B010
- setattr(self.decode_worker.args, "model", self.model) # noqa: B010
+ self.prefill_worker.args.model = self.model
+ self.decode_worker.args.model = self.model
elif self.backend.lower() == "sglang":
- setattr(self.prefill_worker.args, "model-path", self.model) # noqa: B010
- setattr(self.decode_worker.args, "model-path", self.model) # noqa: B010
- setattr(self.prefill_worker.args, "served-model-name", self.model) # noqa: B010
- setattr(self.decode_worker.args, "served-model-name", self.model) # noqa: B010
+ setattr(self.prefill_worker.args, "model-path", self.model)
+ setattr(self.decode_worker.args, "model-path", self.model)
+ setattr(self.prefill_worker.args, "served-model-name", self.model)
+ setattr(self.decode_worker.args, "served-model-name", self.model)🧰 Tools
🪛 GitHub Actions: CI
[error] 187-190: RUF100 Unused noqa directive (B010) detected on multiple lines. Remove unused noqa directives.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py` around lines 183 - 194, Replace
the setattr calls that use the valid Python identifier "model" with direct
assignments (set self.prefill_worker.args.model = self.model and
self.decode_worker.args.model = self.model) and remove the trailing "# noqa:
B010" from those lines; keep using setattr for the hyphenated keys ("model-path"
and "served-model-name") on self.prefill_worker.args and self.decode_worker.args
(but remove the unnecessary "# noqa: B010" comments there as well) so the
hyphenated attributes remain accessed via setattr(self.prefill_worker.args,
"model-path", self.model) and setattr(self.decode_worker.args, "model-path",
self.model) and similarly for "served-model-name", leaving the backend check,
ValueError and return self unchanged.
| prefill_tp = prefill_worker.args.tensor_parallel_size if prefill_worker else 1 | ||
| decode_tp = decode_worker.args.tensor_parallel_size if decode_worker else 1 | ||
| prefill_pp = prefill_worker.args.pipeline_parallel_size if prefill_worker else 1 | ||
| decode_pp = decode_worker.args.pipeline_parallel_size if decode_worker else 1 |
There was a problem hiding this comment.
None values not handled: when prefill_worker exists but prefill_worker.args.tensor_parallel_size is None, prefill_tp becomes None instead of defaulting to 1. Same for pipeline_parallel_size. This will cause TypeError on lines 393 (multiplication) and 386 (comparison).
| prefill_tp = prefill_worker.args.tensor_parallel_size if prefill_worker else 1 | |
| decode_tp = decode_worker.args.tensor_parallel_size if decode_worker else 1 | |
| prefill_pp = prefill_worker.args.pipeline_parallel_size if prefill_worker else 1 | |
| decode_pp = decode_worker.args.pipeline_parallel_size if decode_worker else 1 | |
| prefill_tp = (prefill_worker.args.tensor_parallel_size or 1) if prefill_worker else 1 | |
| decode_tp = (decode_worker.args.tensor_parallel_size or 1) if decode_worker else 1 | |
| prefill_pp = (prefill_worker.args.pipeline_parallel_size or 1) if prefill_worker else 1 | |
| decode_pp = (decode_worker.args.pipeline_parallel_size or 1) if decode_worker else 1 |
| assert isinstance(prefill_n, int), "dynamo.num_prefill_nodes must be an integer" | ||
| assert isinstance(decode_n, int), "dynamo.num_decode_nodes must be an integer" |
There was a problem hiding this comment.
Error messages reference wrong field names - should be decode_worker.num_nodes and prefill_worker.num_nodes, not dynamo.num_decode_nodes and dynamo.num_prefill_nodes
| assert isinstance(prefill_n, int), "dynamo.num_prefill_nodes must be an integer" | |
| assert isinstance(decode_n, int), "dynamo.num_decode_nodes must be an integer" | |
| assert isinstance(prefill_n, int), "prefill_worker.num_nodes must be an integer" | |
| assert isinstance(decode_n, int), "decode_worker.num_nodes must be an integer" |
| --workloads) | ||
| dynamo_args["workloads"]="$2" ;; |
There was a problem hiding this comment.
Duplicate case for --workloads already handled on line 152 above
The hf_home_path may only exist on Slurm compute nodes (e.g., /mnt/vast/...) and not on the gateway/login node where cloudai runs. Downgrade the validation from a hard error to a warning so installation can proceed. Co-authored-by: Cursor <cursoragent@cursor.com>
When hf_home_path points to a filesystem only mounted on compute nodes (e.g., /mnt/vast/...), skip the local model download and installation check. The model is assumed to be pre-staged on the compute nodes where the container mount will resolve. Co-authored-by: Cursor <cursoragent@cursor.com>
- Replace setattr with direct assignment for vllm model args (B010) and remove unused # noqa: B010 from sglang args (RUF100) - Add hf_home_path mount to ai-dynamo sbatch reference data - Fix test_container_mounts to expect the hf_home_path mount - Remove stale huggingface_home_host_path from acceptance test Co-authored-by: Cursor <cursoragent@cursor.com>
| prefill_tp = prefill_worker.args.tensor_parallel_size if prefill_worker else 1 | ||
| decode_tp = decode_worker.args.tensor_parallel_size if decode_worker else 1 | ||
| prefill_pp = prefill_worker.args.pipeline_parallel_size if prefill_worker else 1 | ||
| decode_pp = decode_worker.args.pipeline_parallel_size if decode_worker else 1 |
There was a problem hiding this comment.
When prefill_worker exists but tensor_parallel_size or pipeline_parallel_size are None (which is allowed per WorkerBaseArgs lines 96-105), these will be assigned None instead of defaulting to 1. This causes TypeError on line 393 and 400 when doing arithmetic. Use or 1 to handle None:
| prefill_tp = prefill_worker.args.tensor_parallel_size if prefill_worker else 1 | |
| decode_tp = decode_worker.args.tensor_parallel_size if decode_worker else 1 | |
| prefill_pp = prefill_worker.args.pipeline_parallel_size if prefill_worker else 1 | |
| decode_pp = decode_worker.args.pipeline_parallel_size if decode_worker else 1 | |
| prefill_tp = (prefill_worker.args.tensor_parallel_size if prefill_worker else None) or 1 | |
| decode_tp = (decode_worker.args.tensor_parallel_size if decode_worker else None) or 1 | |
| prefill_pp = (prefill_worker.args.pipeline_parallel_size if prefill_worker else None) or 1 | |
| decode_pp = (decode_worker.args.pipeline_parallel_size if decode_worker else None) or 1 |
Co-authored-by: Cursor <cursoragent@cursor.com>
| prefill_tp = prefill_worker.args.tensor_parallel_size if prefill_worker else 1 | ||
| decode_tp = decode_worker.args.tensor_parallel_size if decode_worker else 1 | ||
| prefill_pp = prefill_worker.args.pipeline_parallel_size if prefill_worker else 1 | ||
| decode_pp = decode_worker.args.pipeline_parallel_size if decode_worker else 1 | ||
| prefill_nodes = prefill_worker.num_nodes if prefill_worker else 0 | ||
| decode_nodes = decode_worker.num_nodes if decode_worker else 1 |
There was a problem hiding this comment.
tensor_parallel_size and pipeline_parallel_size can be None per WorkerBaseArgs definition (lines 96-105). When these are None, the conditional if prefill_worker evaluates to True but prefill_worker.args.tensor_parallel_size is None, causing TypeError on lines 393 and 400 during arithmetic operations.
| prefill_tp = prefill_worker.args.tensor_parallel_size if prefill_worker else 1 | |
| decode_tp = decode_worker.args.tensor_parallel_size if decode_worker else 1 | |
| prefill_pp = prefill_worker.args.pipeline_parallel_size if prefill_worker else 1 | |
| decode_pp = decode_worker.args.pipeline_parallel_size if decode_worker else 1 | |
| prefill_nodes = prefill_worker.num_nodes if prefill_worker else 0 | |
| decode_nodes = decode_worker.num_nodes if decode_worker else 1 | |
| prefill_tp = (prefill_worker.args.tensor_parallel_size or 1) if prefill_worker else 1 | |
| decode_tp = (decode_worker.args.tensor_parallel_size or 1) if decode_worker else 1 | |
| prefill_pp = (prefill_worker.args.pipeline_parallel_size or 1) if prefill_worker else 1 | |
| decode_pp = (decode_worker.args.pipeline_parallel_size or 1) if decode_worker else 1 | |
| prefill_nodes = prefill_worker.num_nodes if prefill_worker else 0 | |
| decode_nodes = decode_worker.num_nodes if decode_worker else 1 |
- Update base_installer.py copyright to 2024-2026 - Fix taplo formatting in sglang_slurm.toml and vllm_kvbm_slurm.toml Co-authored-by: Cursor <cursoragent@cursor.com>
| # launch genai-perf | ||
| log "Launching genai-perf with args: $full_cmd" | ||
|
|
||
| eval "$full_cmd" |
There was a problem hiding this comment.
using eval with $cmdline_args that includes $extra_args creates command injection risk if genai_perf args contain shell metacharacters
| eval "$full_cmd" | |
| # Use array to safely pass arguments without eval | |
| declare -a cmd_array=("$cmd") | |
| for key in "${!genai_perf_args[@]}"; do | |
| cmd_array+=("$key" "${genai_perf_args[$key]}") | |
| done | |
| if [[ -n "$extra_args" ]]; then | |
| read -ra extra_array <<< "$extra_args" | |
| cmd_array+=("${extra_array[@]}") | |
| fi | |
| "${cmd_array[@]}" |
| assert isinstance(prefill_n, int), "dynamo.num_prefill_nodes must be an integer" | ||
| assert isinstance(decode_n, int), "dynamo.num_decode_nodes must be an integer" |
There was a problem hiding this comment.
error messages reference wrong field names - these fields don't exist in the data model
| assert isinstance(prefill_n, int), "dynamo.num_prefill_nodes must be an integer" | |
| assert isinstance(decode_n, int), "dynamo.num_decode_nodes must be an integer" | |
| assert isinstance(prefill_n, int), "prefill_worker.num_nodes must be an integer" | |
| assert isinstance(decode_n, int), "decode_worker.num_nodes must be an integer" |
| ) | ||
|
|
||
| # Add backend-specific args | ||
| if td.cmd_args.dynamo.backend == "sglang": |
There was a problem hiding this comment.
REname to sglang_dsr1
| f"--error={out_dir / 'node-%n-stderr.txt'}", | ||
| "bash", | ||
| f"{td.script.installed_path.absolute()!s}", | ||
| f"{self.container_install_path}/{td.script.src.name}", |
There was a problem hiding this comment.
Revert back to absolute path for non-container compatibility.
| self.decode_worker.args.model_path = self.model | ||
| self.prefill_worker.args.served_model_name = self.model | ||
| self.decode_worker.args.served_model_name = self.model | ||
| else: |
There was a problem hiding this comment.
Check for sglang_dsr1.
| ) | ||
| prefill_worker: PrefillWorkerArgs | None = None | ||
| prefill_worker: WorkerConfig = Field(default_factory=WorkerConfig) | ||
| prefill_cmd: str = Field( |
There was a problem hiding this comment.
Move it inside worker config.
| extra_config_enable_nixl_storage: bool = True | ||
| extra_config_nixl_backend: str = "GDS_MT" | ||
| extra_config_nixl_file_pool_size: int = 64 | ||
| extra_config_nixl_path: str = "%CACHEDIR%" |
There was a problem hiding this comment.
Remove this arg. Handled in bash script.
| url="https://github.com/ai-dynamo/dynamo.git", commit="f7e468c7e8ff0d1426db987564e60572167e8464" | ||
| url="https://github.com/ai-dynamo/dynamo.git", | ||
| commit="f7e468c7e8ff0d1426db987564e60572167e8464", | ||
| mount_as="/git/dynamo", |
There was a problem hiding this comment.
Use absolute path as earlier.
| _hf_model: HFModel | None = None | ||
| constraints: Constraints = Constraints() | ||
|
|
||
| def success_marker(self) -> str: |
| model_config = ConfigDict(extra="allow", populate_by_name=True) | ||
|
|
||
| model: str = "Qwen/Qwen3-0.6B" | ||
| backend: str = "vllm" |
There was a problem hiding this comment.
Change it to Literal
| run_script: str = "" | ||
| lmcache: LMCache = Field(default_factory=LMCache) | ||
| genai_perf: GenAIPerf = Field(default_factory=GenAIPerf) | ||
| workloads: str = "genai_perf.sh" |
There was a problem hiding this comment.
Move it to AiDynamoTestDefinition and use list of literals.
| """Validate test definition.""" | ||
| # Populate git_repos list with all git repositories used by this test definition. | ||
| self.git_repos = [self.dynamo_repo] | ||
| if self.cmd_args.lmcache.repo: |
There was a problem hiding this comment.
Move it to installables.
Summary
Updated AI Dynamo workload to support more workloads. The other workloads will be part of a follow-on PR.
Test plan
test_container_mountsfailure)test_sbatch_generation[ai-dynamo]passes with updated referenceruff checkreports no new lint errors (all flagged issues are pre-existing)Made with Cursor