Create tests/distributed/test_mnnvl_alltoall.py#35241
Create tests/distributed/test_mnnvl_alltoall.py#35241puririshi98 wants to merge 25 commits intovllm-project:mainfrom
Conversation
all 5 tests pass on 8xh100 w/ latest nvidia stack Signed-off-by: Rishi Puri <riship@nvidia.com>
There was a problem hiding this comment.
Code Review
The pull request introduces a new test file for MNNVL AllToAll operations, ensuring the correct functionality and initialization of FlashInfer components within a distributed environment. The tests cover manager initialization, workspace reinitialization, and the ensure_initialized method, as well as a custom communicator wrapper. The setup correctly handles multi-GPU environments and checks for necessary system capabilities like SYS_PTRACE.
One area for improvement is the broad exception handling in has_sys_ptrace_capability, which could mask underlying issues.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Catching a generic Exception can hide specific issues that might arise during the file reading or parsing of /proc/self/status. It's generally better to catch more specific exceptions (e.g., IOError, ValueError) to avoid masking other potential bugs. While this function is for capability checking, a more precise exception handling would improve maintainability and debugging.
| except Exception: | |
| pass | |
| except (IOError, ValueError) as e: | |
| # Log the error for debugging purposes, but continue with alternative checks | |
| print(f"Warning: Error reading /proc/self/status: {e}") |
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
|
recent changes: on dgxh100 |
Signed-off-by: Rishi Puri <riship@nvidia.com>
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Rishi Puri <riship@nvidia.com>
| is_sequence_parallel=True, | ||
| ) | ||
|
|
||
| # Validate dispatch produces expected shapes |
There was a problem hiding this comment.
Can we check the values match as well?
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new, comprehensive test suite for MNNVL AllToAll operations, which is a valuable addition for ensuring the correctness of distributed communication. The tests are well-structured, covering initialization, re-initialization, and data communication in a multi-GPU setup. My review identified a few areas for improvement. Most critically, one of the key validation tests for FlashInferAllToAllManager does not actually invoke the methods it's intended to test, which could lead to a false sense of security. I've also pointed out the use of hardcoded network ports, which could cause test flakiness in a CI environment, and a misleading docstring in another test. Addressing these points will significantly improve the robustness and reliability of this new test suite.
| # Test dispatch_router_logits | ||
| print( | ||
| f"[Rank {rank}] Testing FlashInfer dispatch_router_logits vs reference" | ||
| ) | ||
| ref_hidden, ref_router = reference_manager.dispatch_router_logits( | ||
| hidden_states.clone(), router_logits.clone(), is_sequence_parallel=True | ||
| ) |
There was a problem hiding this comment.
The flashinfer_data_communication_worker initializes flashinfer_manager but then only uses reference_manager (AgRsAll2AllManager) for the dispatch_router_logits call (and subsequent dispatch and combine calls). This means the test is not actually validating the FlashInferAllToAllManager implementation against the reference, which seems to be the intent of this test. The flashinfer_manager's methods should be called and their outputs compared against the reference manager's outputs to ensure correctness.
| # Use spawn method for CUDA compatibility | ||
| mp.set_start_method("spawn", force=True) | ||
|
|
||
| port = "12355" |
There was a problem hiding this comment.
Hardcoding ports can lead to flaky tests, as the port might be in use by another process on the CI machine. This can cause intermittent test failures. It's better to dynamically find a free port for each test run. This issue is present in all test functions in this file that set up a multiprocessing environment (e.g., test_flashinfer_alltoall_workspace_reinitialization, test_flashinfer_alltoall_ensure_initialized, etc.).
You can add a helper function like this at the top of the file:
import socket
from contextlib import closing
def _find_free_port():
with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
s.bind(('', 0))
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
return str(s.getsockname()[1])And then use it in this test and others that use hardcoded ports, for example by changing this line to port = _find_free_port().
| This test validates that the FlashInferAllToAllManager correctly | ||
| communicates data by comparing against reference backends. |
There was a problem hiding this comment.
The docstring for this worker function states that it validates FlashInferAllToAllManager. However, the implementation only uses AgRsAll2AllManager. This is misleading and seems to be a copy-paste error. The docstring should be corrected to reflect what is actually being tested.
| This test validates that the FlashInferAllToAllManager correctly | |
| communicates data by comparing against reference backends. | |
| This test validates that the AgRsAll2AllManager correctly | |
| communicates data. |
|
@leo-cf-tian can you take a look? Since this should be dependent on your PR #36022 |
Signed-off-by: Rishi Puri <riship@nvidia.com>
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Seems like this refers to the existing two-sided implementation (mnnvl all2allv as seen in #21003). @puririshi98 Can you confirm? |
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Yes, this test refers to the MNNVL all2allv implementation from PR #21003. The test file validates FlashInferAllToAllManager which:
|
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
all 5 tests pass on 8xh100 w/ latest nvidia stack
This is part of the NVIDIA effort to add CI to upstream github