Add distributed launcher support for linex and metrix#87
Open
Add distributed launcher support for linex and metrix#87
Conversation
- Add DistributedContext dataclass and env detection for torchrun, mpirun, srun, horovodrun - Linex: rank-scoped output dirs, RankProfile objects, MCP per-rank hotspots - Metrix: rank metadata in ProfileResult/KernelResults/ProfilingResults, rank-suffixed output files - CLI: argparse.REMAINDER for `-- launcher ...` syntax - Both: normalize_command_argv with shlex, accept str | Sequence[str] - Tests for distributed helpers, shlex parsing, rank field propagation Note: command construction is still rocprofv3-wraps-launcher (wrong order). Next step: fix to launcher-wraps-rocprofv3 for correct distributed profiling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously, distributed commands like `torchrun --nproc_per_node=8 train.py` produced `rocprofv3 ... -- torchrun --nproc_per_node=8 train.py` which is wrong. rocprofv3 would profile the launcher process, not the per-rank GPU work. Now we split the command into launcher args and app args, producing: `torchrun --nproc_per_node=8 rocprofv3 ... -- train.py` The launcher spawns N processes, each running rocprofv3 around the app. Changes: - Add split_launcher_command() to both distributed.py modules - Handles torchrun, python -m torch.distributed.*, mpirun/mpiexec, srun, horovodrun - Update linex/api.py and metrix/rocprof_wrapper.py to use launcher wrapping - Add tests verifying correct command ordering for all launcher types Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of trying to parse launcher flags from a combined command string (fragile, requires hardcoded flag sets per launcher), let the user provide the launcher and app commands separately: # Python API profiler.profile(command="train.py", launcher="torchrun --nproc_per_node=8") # Metrix CLI metrix profile --launcher "torchrun --nproc_per_node=8" -- train.py This is unambiguous, works with any launcher (including custom ones), and requires no flag-parsing maintenance. - Remove split_launcher_command() and all _split_* helpers - Add launcher parameter to Linex.profile(), Metrix.profile(), ROCProfV3Wrapper.profile(), CounterBackend.profile(), all backend implementations, CLI (--launcher flag), and MCP tools - Update tests and READMEs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds first-class “distributed launcher” support to Linex and Metrix so profiling commands can be invoked as launcher rocprofv3 ... -- app, and introduces rank metadata propagation/suffixing utilities to avoid output clobbering.
Changes:
- Add distributed context detection + argv normalization helpers (shlex-based) for both Metrix and Linex.
- Extend Metrix/Linex APIs, CLI, and MCP tools with an explicit
launcherparameter and propagate rank metadata into result objects/output. - Add unit tests covering env detection, argv normalization, rank suffixing, and launcher command ordering.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| metrix/tests/unit/test_rocprof_wrapper.py | Adds tests for shlex parsing, rank metadata propagation, and launcher command ordering. |
| metrix/tests/unit/test_distributed.py | New unit tests for distributed helpers (env detection, argv normalization, rank suffixing). |
| metrix/src/metrix/utils/distributed.py | New distributed helper module (context detection, shlex argv normalization, rank suffixing). |
| metrix/src/metrix/profiler/rocprof_wrapper.py | Accepts launcher/env, uses shlex parsing, and annotates ProfileResult with distributed metadata. |
| metrix/src/metrix/mcp/server.py | Adds launcher param to MCP tool and includes rank metadata in the response. |
| metrix/src/metrix/cli/profile_cmd.py | Adds --launcher plumbing, remainder target parsing normalization, and rank-aware output formatting/suffixing. |
| metrix/src/metrix/cli/main.py | Adds --launcher flag and switches target parsing to argparse.REMAINDER. |
| metrix/src/metrix/backends/gfx942.py | Updates backend signatures to accept launcher (but currently not forwarded). |
| metrix/src/metrix/backends/gfx90a.py | Updates backend signatures to accept launcher (but currently not forwarded). |
| metrix/src/metrix/backends/gfx1201.py | Updates backend signatures to accept launcher (but currently not forwarded). |
| metrix/src/metrix/backends/base.py | Adds rank fields to ProfileResult, adds launcher to API, and adds rank-prefixed aggregation keys. |
| metrix/src/metrix/api.py | Adds launcher support and rank metadata to ProfilingResults/KernelResults. |
| metrix/README.md | Documents distributed launcher usage and rank-suffixed outputs. |
| linex/tests/test_distributed_api.py | New tests for distributed helpers, rank-scoped output, deterministic ui dir choice, and launcher ordering. |
| linex/src/linex/mcp/server.py | Adds launcher plumbing + per-rank outputs to MCP responses (currently contains a syntax error). |
| linex/src/linex/distributed.py | New distributed helper module for Linex (context detection + argv normalization). |
| linex/src/linex/api.py | Adds distributed context tracking, rank-scoped output dirs, launcher support, and RankProfile aggregation. |
| linex/src/linex/init.py | Exports RankProfile in the public API. |
| linex/README.md | Documents distributed launcher usage and new distributed properties. |
Comments suppressed due to low confidence (3)
metrix/src/metrix/backends/gfx942.py:107
_run_rocprofnow takeslauncher, but the implementation ignores it when callingROCProfV3Wrapper.profile(...). This makeslaunchera no-op for this backend. Passlauncher=launcherthrough to the wrapper call (and ensure callers forward it).
def _run_rocprof(
self,
command: str | Sequence[str],
counters: List[str],
kernel_filter: Optional[str] = None,
cwd: Optional[str] = None,
launcher: Optional[str | Sequence[str]] = None,
timeout_seconds: Optional[int] = 0,
) -> List[ProfileResult]:
"""Run rocprofv3 and return results (single pass only - base class handles multi-pass)"""
wrapper = ROCProfV3Wrapper(timeout_seconds=timeout_seconds)
return wrapper.profile(command, counters, kernel_filter=kernel_filter, cwd=cwd)
metrix/src/metrix/backends/gfx1201.py:74
_run_rocprofacceptslauncherbut does not forward it toROCProfV3Wrapper.profile(...), so--launcherhas no effect for gfx1201. Passlauncher=launcherto the wrapper call (and ensure the base class forwards it when invoking_run_rocprof).
def _run_rocprof(
self,
command: str | Sequence[str],
counters: List[str],
kernel_filter: Optional[str] = None,
cwd: Optional[str] = None,
launcher: Optional[str | Sequence[str]] = None,
timeout_seconds: Optional[int] = 0,
kernel_iteration_range: Optional[str] = None,
) -> List[ProfileResult]:
wrapper = ROCProfV3Wrapper(timeout_seconds=timeout_seconds)
extra_counters_path = Path(__file__).parent / "counter_defs.yaml"
return wrapper.profile(
command=command,
counters=counters,
kernel_filter=kernel_filter,
cwd=cwd,
kernel_iteration_range=kernel_iteration_range,
extra_counters_path=extra_counters_path if extra_counters_path.exists() else None,
arch=self.device_specs.arch,
)
metrix/src/metrix/backends/gfx90a.py:107
_run_rocprofnow accepts alauncherparameter, but it isn’t passed down toROCProfV3Wrapper.profile(...), so the launcher never affects the actual subprocess command. Passlauncher=launcherthrough to the wrapper call (and ensure the base class forwards it when invoking_run_rocprof).
def _run_rocprof(
self,
command: str | Sequence[str],
counters: List[str],
kernel_filter: Optional[str] = None,
cwd: Optional[str] = None,
launcher: Optional[str | Sequence[str]] = None,
timeout_seconds: Optional[int] = 0,
) -> List[ProfileResult]:
"""Run rocprofv3 and return results (single pass only - base class handles multi-pass)"""
wrapper = ROCProfV3Wrapper(timeout_seconds=timeout_seconds)
return wrapper.profile(command, counters, kernel_filter=kernel_filter, cwd=cwd)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… Python 3.8 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… batch path, clarify launcher semantics - Move profile_command docstring above code (was displaced) - Forward launcher param in CounterBackend recursive batch call - Add Note sections to Linex.profile() and Metrix.profile() explaining that launcher is for mpirun-style use, and for torchrun the correct pattern is running metrix/linex under torchrun (not the reverse) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
launcherparameter that ensures the correct command order (launcher rocprofv3 ... -- appinstead ofrocprofv3 ... -- launcher app)global_rank,local_rank,world_size,hostname,launcher) from environment variables set by torchrun, mpirun, srun, and horovodrunrank0000/,rank0001/, ...), per-rankRankProfileobjects, MCP per-rank hotspotsProfileResult/KernelResults/ProfilingResults, rank-suffixed output files (results.rank0003.json),--launcherCLI flag, rank columns in CSV/JSON/text outputAPI:
Test plan
DistributedContextenv detection (torchrun, mpirun, srun)normalize_command_argv(string and sequence input)apply_rank_suffix(with/without extension, single process)🤖 Generated with Claude Code