Skip to content

Create tests/distributed/test_mnnvl_alltoall.py#35241

Open
puririshi98 wants to merge 25 commits intovllm-project:mainfrom
puririshi98:patch-1
Open

Create tests/distributed/test_mnnvl_alltoall.py#35241
puririshi98 wants to merge 25 commits intovllm-project:mainfrom
puririshi98:patch-1

Conversation

@puririshi98
Copy link

@puririshi98 puririshi98 commented Feb 24, 2026

all 5 tests pass on 8xh100 w/ latest nvidia stack

This is part of the NVIDIA effort to add CI to upstream github

all 5 tests pass on 8xh100 w/ latest nvidia stack 

Signed-off-by: Rishi Puri <riship@nvidia.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a new test file for MNNVL AllToAll operations, ensuring the correct functionality and initialization of FlashInfer components within a distributed environment. The tests cover manager initialization, workspace reinitialization, and the ensure_initialized method, as well as a custom communicator wrapper. The setup correctly handles multi-GPU environments and checks for necessary system capabilities like SYS_PTRACE.

One area for improvement is the broad exception handling in has_sys_ptrace_capability, which could mask underlying issues.

Comment on lines +60 to +61
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Catching a generic Exception can hide specific issues that might arise during the file reading or parsing of /proc/self/status. It's generally better to catch more specific exceptions (e.g., IOError, ValueError) to avoid masking other potential bugs. While this function is for capability checking, a more precise exception handling would improve maintainability and debugging.

Suggested change
except Exception:
pass
except (IOError, ValueError) as e:
# Log the error for debugging purposes, but continue with alternative checks
print(f"Warning: Error reading /proc/self/status: {e}")

@mergify
Copy link

mergify bot commented Feb 24, 2026

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify
Copy link

mergify bot commented Feb 25, 2026

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify
Copy link

mergify bot commented Feb 26, 2026

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Rishi Puri <riship@nvidia.com>
@mergify
Copy link

mergify bot commented Feb 27, 2026

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify
Copy link

mergify bot commented Feb 27, 2026

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify
Copy link

mergify bot commented Feb 27, 2026

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Rishi Puri <riship@nvidia.com>
@mergify mergify bot added the ci/build label Mar 6, 2026
puririshi98 and others added 2 commits March 6, 2026 14:47
@puririshi98
Copy link
Author

puririshi98 commented Mar 9, 2026

recent changes:

tests/distributed/test_mnnvl_alltoall.py::test_flashinfer_all2all_import PASSED                                                                                                                                                              [ 12%]
tests/distributed/test_mnnvl_alltoall.py::test_flashinfer_alltoall_manager_initialization[2] PASSED                                                                                                                                          [ 25%]
tests/distributed/test_mnnvl_alltoall.py::test_flashinfer_alltoall_workspace_reinitialization[2] PASSED                                                                                                                                      [ 37%]
tests/distributed/test_mnnvl_alltoall.py::test_flashinfer_alltoall_ensure_initialized[2] PASSED                                                                                                                                              [ 50%]
tests/distributed/test_mnnvl_alltoall.py::test_alltoall_data_communication[2] PASSED                                                                                                                                                         [ 62%]
tests/distributed/test_mnnvl_alltoall.py::test_flashinfer_alltoall_data_communication[2] PASSED                                                                                                                                              [ 75%]
tests/distributed/test_mnnvl_alltoall.py::test_alltoall_deterministic_data_validation[2] PASSED                                                                                                                                              [ 87%]
tests/distributed/test_mnnvl_alltoall.py::test_custom_communicator PASSED     

on dgxh100

puririshi98 and others added 2 commits March 9, 2026 16:24
Signed-off-by: Rishi Puri <riship@nvidia.com>
@mergify
Copy link

mergify bot commented Mar 9, 2026

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Rishi Puri <riship@nvidia.com>
is_sequence_parallel=True,
)

# Validate dispatch produces expected shapes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check the values match as well?

@jasonlizhengjian
Copy link
Contributor

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new, comprehensive test suite for MNNVL AllToAll operations, which is a valuable addition for ensuring the correctness of distributed communication. The tests are well-structured, covering initialization, re-initialization, and data communication in a multi-GPU setup. My review identified a few areas for improvement. Most critically, one of the key validation tests for FlashInferAllToAllManager does not actually invoke the methods it's intended to test, which could lead to a false sense of security. I've also pointed out the use of hardcoded network ports, which could cause test flakiness in a CI environment, and a misleading docstring in another test. Addressing these points will significantly improve the robustness and reliability of this new test suite.

Comment on lines +636 to +642
# Test dispatch_router_logits
print(
f"[Rank {rank}] Testing FlashInfer dispatch_router_logits vs reference"
)
ref_hidden, ref_router = reference_manager.dispatch_router_logits(
hidden_states.clone(), router_logits.clone(), is_sequence_parallel=True
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The flashinfer_data_communication_worker initializes flashinfer_manager but then only uses reference_manager (AgRsAll2AllManager) for the dispatch_router_logits call (and subsequent dispatch and combine calls). This means the test is not actually validating the FlashInferAllToAllManager implementation against the reference, which seems to be the intent of this test. The flashinfer_manager's methods should be called and their outputs compared against the reference manager's outputs to ensure correctness.

# Use spawn method for CUDA compatibility
mp.set_start_method("spawn", force=True)

port = "12355"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding ports can lead to flaky tests, as the port might be in use by another process on the CI machine. This can cause intermittent test failures. It's better to dynamically find a free port for each test run. This issue is present in all test functions in this file that set up a multiprocessing environment (e.g., test_flashinfer_alltoall_workspace_reinitialization, test_flashinfer_alltoall_ensure_initialized, etc.).

You can add a helper function like this at the top of the file:

import socket
from contextlib import closing

def _find_free_port():
    with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
        s.bind(('', 0))
        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        return str(s.getsockname()[1])

And then use it in this test and others that use hardcoded ports, for example by changing this line to port = _find_free_port().

Comment on lines +375 to +376
This test validates that the FlashInferAllToAllManager correctly
communicates data by comparing against reference backends.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The docstring for this worker function states that it validates FlashInferAllToAllManager. However, the implementation only uses AgRsAll2AllManager. This is misleading and seems to be a copy-paste error. The docstring should be corrected to reflect what is actually being tested.

Suggested change
This test validates that the FlashInferAllToAllManager correctly
communicates data by comparing against reference backends.
This test validates that the AgRsAll2AllManager correctly
communicates data.

@jasonlizhengjian
Copy link
Contributor

@leo-cf-tian can you take a look? Since this should be dependent on your PR #36022

Signed-off-by: Rishi Puri <riship@nvidia.com>
@mergify
Copy link

mergify bot commented Mar 10, 2026

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@leo-cf-tian
Copy link

can you take a look? Since this should be dependent on your PR #36022

Seems like this refers to the existing two-sided implementation (mnnvl all2allv as seen in #21003). @puririshi98 Can you confirm?

@mergify
Copy link

mergify bot commented Mar 10, 2026

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@mergify
Copy link

mergify bot commented Mar 10, 2026

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@puririshi98
Copy link
Author

Can you confirm?

Yes, this test refers to the MNNVL all2allv implementation from PR #21003.

The test file validates FlashInferAllToAllManager which:

  • Uses flashinfer.comm.trtllm_alltoall.MnnvlMoe APIs
  • Is configured via VLLM_ALL2ALL_BACKEND=flashinfer_all2allv

@mergify
Copy link

mergify bot commented Mar 11, 2026

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants