Create tests/distributed/test_mnnvl_alltoall.py by puririshi98 · Pull Request #35241 · vllm-project/vllm

puririshi98 · 2026-02-24T22:53:18Z

all 5 tests pass on 8xh100 w/ latest nvidia stack

This is part of the NVIDIA effort to add CI to upstream github

all 5 tests pass on 8xh100 w/ latest nvidia stack Signed-off-by: Rishi Puri <riship@nvidia.com>

gemini-code-assist

Code Review

The pull request introduces a new test file for MNNVL AllToAll operations, ensuring the correct functionality and initialization of FlashInfer components within a distributed environment. The tests cover manager initialization, workspace reinitialization, and the ensure_initialized method, as well as a custom communicator wrapper. The setup correctly handles multi-GPU environments and checks for necessary system capabilities like SYS_PTRACE.

One area for improvement is the broad exception handling in has_sys_ptrace_capability, which could mask underlying issues.

gemini-code-assist · 2026-02-24T22:59:02Z

tests/distributed/test_mnnvl_alltoall.py

+    except Exception:
+        pass


Catching a generic Exception can hide specific issues that might arise during the file reading or parsing of /proc/self/status. It's generally better to catch more specific exceptions (e.g., IOError, ValueError) to avoid masking other potential bugs. While this function is for capability checking, a more precise exception handling would improve maintainability and debugging.

Suggested change

except Exception:

pass

except (IOError, ValueError) as e:

# Log the error for debugging purposes, but continue with alternative checks

print(f"Warning: Error reading /proc/self/status: {e}")

mergify · 2026-02-24T23:01:44Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-25T00:36:57Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-26T00:32:02Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Rishi Puri <riship@nvidia.com>

mergify · 2026-02-27T08:41:14Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-27T09:01:09Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-27T17:00:13Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Rishi Puri <riship@nvidia.com>

.buildkite/test_areas/distributed.yaml

Signed-off-by: Rishi Puri <riship@nvidia.com>

puririshi98 · 2026-03-09T23:23:25Z

recent changes:

tests/distributed/test_mnnvl_alltoall.py::test_flashinfer_all2all_import PASSED                                                                                                                                                              [ 12%]
tests/distributed/test_mnnvl_alltoall.py::test_flashinfer_alltoall_manager_initialization[2] PASSED                                                                                                                                          [ 25%]
tests/distributed/test_mnnvl_alltoall.py::test_flashinfer_alltoall_workspace_reinitialization[2] PASSED                                                                                                                                      [ 37%]
tests/distributed/test_mnnvl_alltoall.py::test_flashinfer_alltoall_ensure_initialized[2] PASSED                                                                                                                                              [ 50%]
tests/distributed/test_mnnvl_alltoall.py::test_alltoall_data_communication[2] PASSED                                                                                                                                                         [ 62%]
tests/distributed/test_mnnvl_alltoall.py::test_flashinfer_alltoall_data_communication[2] PASSED                                                                                                                                              [ 75%]
tests/distributed/test_mnnvl_alltoall.py::test_alltoall_deterministic_data_validation[2] PASSED                                                                                                                                              [ 87%]
tests/distributed/test_mnnvl_alltoall.py::test_custom_communicator PASSED

on dgxh100

Signed-off-by: Rishi Puri <riship@nvidia.com>

mergify · 2026-03-09T23:32:49Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Rishi Puri <riship@nvidia.com>

wzhao18 · 2026-03-10T14:31:24Z

tests/distributed/test_mnnvl_alltoall.py

+                is_sequence_parallel=True,
+            )
+
+            # Validate dispatch produces expected shapes


Can we check the values match as well?

jasonlizhengjian · 2026-03-10T16:02:55Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a new, comprehensive test suite for MNNVL AllToAll operations, which is a valuable addition for ensuring the correctness of distributed communication. The tests are well-structured, covering initialization, re-initialization, and data communication in a multi-GPU setup. My review identified a few areas for improvement. Most critically, one of the key validation tests for FlashInferAllToAllManager does not actually invoke the methods it's intended to test, which could lead to a false sense of security. I've also pointed out the use of hardcoded network ports, which could cause test flakiness in a CI environment, and a misleading docstring in another test. Addressing these points will significantly improve the robustness and reliability of this new test suite.

gemini-code-assist · 2026-03-10T16:17:04Z

tests/distributed/test_mnnvl_alltoall.py

+            # Test dispatch_router_logits
+            print(
+                f"[Rank {rank}] Testing FlashInfer dispatch_router_logits vs reference"
+            )
+            ref_hidden, ref_router = reference_manager.dispatch_router_logits(
+                hidden_states.clone(), router_logits.clone(), is_sequence_parallel=True
+            )


The flashinfer_data_communication_worker initializes flashinfer_manager but then only uses reference_manager (AgRsAll2AllManager) for the dispatch_router_logits call (and subsequent dispatch and combine calls). This means the test is not actually validating the FlashInferAllToAllManager implementation against the reference, which seems to be the intent of this test. The flashinfer_manager's methods should be called and their outputs compared against the reference manager's outputs to ensure correctness.

gemini-code-assist · 2026-03-10T16:17:04Z

tests/distributed/test_mnnvl_alltoall.py

+    # Use spawn method for CUDA compatibility
+    mp.set_start_method("spawn", force=True)
+
+    port = "12355"


Hardcoding ports can lead to flaky tests, as the port might be in use by another process on the CI machine. This can cause intermittent test failures. It's better to dynamically find a free port for each test run. This issue is present in all test functions in this file that set up a multiprocessing environment (e.g., test_flashinfer_alltoall_workspace_reinitialization, test_flashinfer_alltoall_ensure_initialized, etc.).

You can add a helper function like this at the top of the file:

import socket from contextlib import closing def _find_free_port(): with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s: s.bind(('', 0)) s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) return str(s.getsockname()[1])

And then use it in this test and others that use hardcoded ports, for example by changing this line to port = _find_free_port().

gemini-code-assist · 2026-03-10T16:17:04Z

tests/distributed/test_mnnvl_alltoall.py

+    This test validates that the FlashInferAllToAllManager correctly
+    communicates data by comparing against reference backends.


The docstring for this worker function states that it validates FlashInferAllToAllManager. However, the implementation only uses AgRsAll2AllManager. This is misleading and seems to be a copy-paste error. The docstring should be corrected to reflect what is actually being tested.

Suggested change

This test validates that the FlashInferAllToAllManager correctly

communicates data by comparing against reference backends.

This test validates that the AgRsAll2AllManager correctly

communicates data.

jasonlizhengjian · 2026-03-10T18:40:33Z

@leo-cf-tian can you take a look? Since this should be dependent on your PR #36022

Signed-off-by: Rishi Puri <riship@nvidia.com>

mergify · 2026-03-10T19:12:18Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

leo-cf-tian · 2026-03-10T19:24:41Z

can you take a look? Since this should be dependent on your PR #36022

Seems like this refers to the existing two-sided implementation (mnnvl all2allv as seen in #21003). @puririshi98 Can you confirm?

mergify · 2026-03-10T22:31:14Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-10T22:44:22Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

puririshi98 · 2026-03-10T22:46:11Z

Can you confirm?

Yes, this test refers to the MNNVL all2allv implementation from PR #21003.

The test file validates FlashInferAllToAllManager which:

Uses flashinfer.comm.trtllm_alltoall.MnnvlMoe APIs
Is configured via VLLM_ALL2ALL_BACKEND=flashinfer_all2allv

mergify · 2026-03-11T00:06:46Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Create tests/distributed/test_mnnvl_alltoall.py

a93eb3d

all 5 tests pass on 8xh100 w/ latest nvidia stack Signed-off-by: Rishi Puri <riship@nvidia.com>

gemini-code-assist bot reviewed Feb 24, 2026

View reviewed changes

Merge branch 'main' into patch-1

81518b4

Merge branch 'main' into patch-1

6d9f403

fix lint

a0bb03a

Signed-off-by: Rishi Puri <riship@nvidia.com>

Merge branch 'main' into patch-1

0c1d891

Merge branch 'main' into patch-1

3e549ee

puririshi98 and others added 8 commits February 27, 2026 19:33

precommit

96e3808

Signed-off-by: Rishi Puri <riship@nvidia.com>

Merge branch 'main' into patch-1

6f25a5c

Merge branch 'main' into patch-1

e067566

Merge branch 'main' into patch-1

9b5c72f

Merge branch 'main' into patch-1

fe5774a

Merge branch 'main' into patch-1

7d7ba9a

Merge branch 'main' into patch-1

9108f52

Merge branch 'main' into patch-1

ee1b67e

jasonlizhengjian mentioned this pull request Mar 6, 2026

[Tracking issue]: NVIDIA CI improvements #36264

Open

Update distributed.yaml

c9cb5f8

Signed-off-by: Rishi Puri <riship@nvidia.com>

mergify bot added the ci/build label Mar 6, 2026

jasonlizhengjian reviewed Mar 6, 2026

View reviewed changes

.buildkite/test_areas/distributed.yaml Show resolved Hide resolved

puririshi98 and others added 2 commits March 6, 2026 14:47

addressing review to move to h100 ci

9e4ea3e

Signed-off-by: Rishi Puri <riship@nvidia.com>

Merge branch 'main' into patch-1

82303e5

puririshi98 and others added 2 commits March 9, 2026 16:24

Update test_mnnvl_alltoall.py

ea9f002

Signed-off-by: Rishi Puri <riship@nvidia.com>

Merge branch 'main' into patch-1

e95863c

Fix test_mnnvl_alltoall

21cf347

Signed-off-by: Rishi Puri <riship@nvidia.com>

wzhao18 reviewed Mar 10, 2026

View reviewed changes

gemini-code-assist bot reviewed Mar 10, 2026

View reviewed changes

Merge branch 'main' into patch-1

222db47

Update test_mnnvl_alltoall.py

605905b

Signed-off-by: Rishi Puri <riship@nvidia.com>

Merge branch 'main' into patch-1

0c603c6

Merge branch 'main' into patch-1

a67efdf

Merge branch 'main' into patch-1

4361115

-    except Exception:
-        pass
+    except (IOError, ValueError) as e:
+        # Log the error for debugging purposes, but continue with alternative checks
+        print(f"Warning: Error reading /proc/self/status: {e}")

		This test validates that the FlashInferAllToAllManager correctly
		communicates data by comparing against reference backends.

Uh oh!

Conversation

puririshi98 commented Feb 24, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 24, 2026

Uh oh!

mergify bot commented Feb 25, 2026

Uh oh!

mergify bot commented Feb 26, 2026

Uh oh!

mergify bot commented Feb 27, 2026

Uh oh!

mergify bot commented Feb 27, 2026

Uh oh!

mergify bot commented Feb 27, 2026

Uh oh!

Uh oh!

puririshi98 commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Mar 9, 2026

Uh oh!

wzhao18 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

jasonlizhengjian commented Mar 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

jasonlizhengjian commented Mar 10, 2026

Uh oh!

mergify bot commented Mar 10, 2026

Uh oh!

leo-cf-tian commented Mar 10, 2026

Uh oh!

mergify bot commented Mar 10, 2026

Uh oh!

mergify bot commented Mar 10, 2026

Uh oh!

puririshi98 commented Mar 10, 2026

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

puririshi98 commented Feb 24, 2026 •

edited by github-actions bot

Loading

puririshi98 commented Mar 9, 2026 •

edited

Loading