Skip to content

Comments

Single sbatch + NIXL + ETCD issues#812

Merged
podkidyshev merged 6 commits intomainfrom
ipod/CLOUDAI-18/nixl-etcd-fail
Feb 20, 2026
Merged

Single sbatch + NIXL + ETCD issues#812
podkidyshev merged 6 commits intomainfrom
ipod/CLOUDAI-18/nixl-etcd-fail

Conversation

@podkidyshev
Copy link
Contributor

@podkidyshev podkidyshev commented Feb 19, 2026

Summary

Fixes CLOUDAI-18. Now the ETCD process is gracefully terminated.

  • It was:

    1. SIGKILL sent to the srun process and killed instantly (local srun process, not the actual one on the compute node).
    2. It takes a few moments until slurm detects it and starts to kill the actuall ETCD process on the compute node.
    3. However our wait mechanism didn't work because the local srun process was already dead (pid doesn't exist)
  • Now:

    1. SIGTERM is sent. So that the local srun process sends the signal to the slurm scheduler
    2. Slurm scheduler starts killing the etcd on the compute node (local srun is still alive - pid exists - our wait works)
    3. After slurm finishes killing ETCD, it kills the local srun and our wait mechanism releases

Test Plan

Reproduced the issue according to the description in the base redmine ticket. After that applied the fix and executed a few times - no failures spotted.

Additional Notes

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 19, 2026

📝 Walkthrough

Walkthrough

Removed the PID kill-and-wait helper from the NIXL command generator and replaced all call sites with a single graceful SIGTERM (kill -TERM $etcd_pid) in command-generation code and reference SLURM scripts; minor header year updates only.

Changes

Cohort / File(s) Summary
Core removal
src/cloudai/workloads/common/nixl.py
Removed gen_kill_and_wait_cmd(self, pid_var: str, timeout: int = 60) -> list[str] from NIXLCmdGenBase.
Command generation updates
src/cloudai/workloads/nixl_bench/slurm_command_gen_strategy.py, src/cloudai/workloads/nixl_kvbench/slurm_command_gen_strategy.py, src/cloudai/workloads/nixl_perftest/slurm_command_gen_strategy.py
Replaced dynamic gen_kill_and_wait_cmd("etcd_pid") calls with the literal "kill -TERM $etcd_pid\n" in generated srun/slurm command sequences; header year tweaks.
Reference SLURM scripts
tests/ref_data/nixl-kvbench.sbatch, tests/ref_data/nixl-perftest.sbatch, tests/ref_data/nixl_bench.sbatch
Replaced forceful kill -9 plus timeout/wait loop with a single kill -TERM $etcd_pid (removed the wait/timeout and related diagnostic/error paths).
Tests
tests/workloads/nixl_bench/test_command_gen_strategy_slurm.py
Removed the test function that asserted gen_kill_and_wait_cmd behavior (test deleted).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I tapped the PID with gentle TERM, not spite,
A quiet hush replaces furious fight.
No countdowns, no alarms to sting—
I sent a calm, small ending wing. 🌿

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title references ETCD and NIXL issues related to the changeset, but is somewhat vague and doesn't clearly convey the main change (graceful termination instead of forced kill).
Description check ✅ Passed The description is directly related to the changeset, explaining the rationale for changing from SIGKILL to SIGTERM and how it fixes the wait mechanism issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ipod/CLOUDAI-18/nixl-etcd-fail

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@podkidyshev podkidyshev marked this pull request as ready for review February 19, 2026 14:05
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 19, 2026

Greptile Summary

Changes SIGKILL to SIGTERM for ETCD process termination to enable graceful shutdown. Previously, sending SIGKILL immediately killed the local srun process, causing the wait mechanism to fail because the PID no longer existed while the actual ETCD process on the compute node was still running. With SIGTERM, the local srun process forwards the signal to Slurm, remains alive during ETCD shutdown, and allows the wait mechanism to function correctly.

Key changes:

  • Modified gen_kill_and_wait_cmd() in src/cloudai/workloads/common/nixl.py to use kill -TERM instead of kill -9
  • Updated all test reference sbatch files to reflect the new signal type
  • Updated unit test expectations to match the new command generation
  • Added missing newlines at end of two test reference files
  • Updated copyright year to 2025-2026

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The change is well-scoped, fixes a specific race condition issue, and all test references have been consistently updated. The fix correctly addresses the root cause by allowing the local srun process to remain alive during ETCD termination, ensuring the wait mechanism works as intended. All previous review feedback has been addressed.
  • No files require special attention

Important Files Changed

Filename Overview
src/cloudai/workloads/common/nixl.py Changed kill -9 to kill -TERM in gen_kill_and_wait_cmd() to allow graceful ETCD termination via SIGTERM, updated copyright year to 2025-2026
tests/ref_data/nixl-kvbench.sbatch Test reference data updated to reflect kill -TERM change, added missing newline at end of file
tests/ref_data/nixl-perftest.sbatch Test reference data updated to reflect kill -TERM change
tests/ref_data/nixl_bench.sbatch Test reference data updated to reflect kill -TERM change, added missing newline at end of file
tests/workloads/nixl_bench/test_command_gen_strategy_slurm.py Unit test updated to expect kill -TERM instead of kill -9 in generated command

Sequence Diagram

sequenceDiagram
    participant Script as sbatch script
    participant LocalSrun as Local srun process
    participant Slurm as Slurm Scheduler
    participant ETCD as ETCD on compute node
    
    Note over Script,ETCD: Old Behavior (SIGKILL)
    Script->>LocalSrun: srun etcd &
    LocalSrun->>Slurm: Submit ETCD job
    Slurm->>ETCD: Start ETCD process
    Script->>LocalSrun: kill -9 $etcd_pid
    LocalSrun--xLocalSrun: Killed instantly
    Script->>LocalSrun: wait (kill -0 check)
    Note over Script,LocalSrun: PID doesn't exist - wait fails
    Slurm->>ETCD: Eventually kills ETCD
    ETCD--xETCD: Terminated (not graceful)
    
    Note over Script,ETCD: New Behavior (SIGTERM)
    Script->>LocalSrun: srun etcd &
    LocalSrun->>Slurm: Submit ETCD job
    Slurm->>ETCD: Start ETCD process
    Script->>LocalSrun: kill -TERM $etcd_pid
    LocalSrun->>Slurm: Forward SIGTERM signal
    Note over LocalSrun: Still alive, PID exists
    Slurm->>ETCD: Gracefully terminate ETCD
    ETCD->>ETCD: Cleanup and shutdown
    ETCD-->>Slurm: Process exited
    Slurm->>LocalSrun: Kill local srun
    LocalSrun-->>Script: Process terminated
    Script->>Script: wait completes successfully
Loading

Last reviewed commit: 256ff50

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cloudai/workloads/nixl_perftest/slurm_command_gen_strategy.py (1)

1-2: ⚠️ Potential issue | 🔴 Critical

CI-blocking: copyright year not updated.

The other modified strategy files (nixl_bench, nixl_kvbench) and common/nixl.py were all bumped to 2025-2026, but this file was missed, causing the pipeline to fail.

🐛 Proposed fix
-# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/workloads/nixl_perftest/slurm_command_gen_strategy.py` around
lines 1 - 2, Update the header comment years in this file to match the other
modified strategy files by changing the SPDX/copyright lines from 2025 to
2025-2026 (i.e., bump the copyright range in the top-of-file comments so they
match the format used in nixl_bench, nixl_kvbench, and common/nixl).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/cloudai/workloads/nixl_perftest/slurm_command_gen_strategy.py`:
- Around line 1-2: Update the header comment years in this file to match the
other modified strategy files by changing the SPDX/copyright lines from 2025 to
2025-2026 (i.e., bump the copyright range in the top-of-file comments so they
match the format used in nixl_bench, nixl_kvbench, and common/nixl).

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@podkidyshev podkidyshev force-pushed the ipod/CLOUDAI-18/nixl-etcd-fail branch from 9987bb4 to 842449c Compare February 19, 2026 17:32
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@podkidyshev podkidyshev marked this pull request as draft February 20, 2026 15:54
@podkidyshev podkidyshev changed the title Fixes CLOUDAI-18: Single sbatch + NIXL + ETCD issues Single sbatch + NIXL + ETCD issues Feb 20, 2026
@podkidyshev podkidyshev marked this pull request as ready for review February 20, 2026 16:18
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@podkidyshev podkidyshev merged commit d00e042 into main Feb 20, 2026
5 checks passed
@podkidyshev podkidyshev deleted the ipod/CLOUDAI-18/nixl-etcd-fail branch February 20, 2026 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants