Robust signaling for coordinator inference by tdene · Pull Request #3563 · NVIDIA/Megatron-LM

tdene · 2026-02-24T16:59:44Z

What does this PR do ?

State Diagram

          +------>  RUNNING  <------+
          |         +--+--+         | 
          |            |            |
       UNPAUSE      PAUSE          |
      (broadcast)  (idempotent)    |
          |            |            |
          |            v            |
          +-------  PAUSING        |
          |         +--+--+        |
          |            |           |
          |      EP all-reduce     |
          |       then world       |
          |       barrier          |
          |            |           |
          |            v           |
          +-------  PAUSED --------+
          |         +--+--+--+
          |            |     |     
          |        SUSPEND  STOP --------+
          |            |                 |
          |            v                 |
          |     SUSPENDING               |
          |         |                    |
          |    world barrier             |
          |         |                    |
          |         v                    v
          |     SUSPENDED ----STOP---> STOPPING
          |         |                    |
          |       RESUME            world barrier
          |         |                    |
          |         v                    v
          |      RESUMING              STOPPED
          |         |                  teardown, exit
          |    world barrier
          |         |
          +----  PAUSED
              (then UNPAUSE)

PAUSE Protocol (details)

Client           Coordinator         mp_src (engine)      mp_workers
  |--PAUSE---------->|                    |                   |
  |                  | state = PAUSED     |                   |
  |                  |---PAUSE----------->|                   |
  |                  |---PAUSE----------->| (all dp_ranks)    |
  |                  |                    |--PUB PAUSE------->|
  |                  |                    |                   |
  |                  |                    | state = PAUSING   |
  |                  |                    | EP all-reduce:    |
  |                  |                    |   report 0        |
  |                  |                    |   dummy forward   |
  |                  |                    |   until consensus |
  |                  |                    |                   |
  |                  |                    | world barrier     |
  |                  |                    | (all ranks sync)  |
  |                  |                    |                   |
  |                  |                    | state = PAUSED    |
  |                  |                    | paused.set()      |

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

janEbert · 2026-02-26T09:28:05Z

Hey, this still saw a lot of active work after opening it. Is it still in a WIP state? If yes, could you mark it as a draft, please? :)

tdene · 2026-02-26T10:48:40Z

megatron/core/inference/data_parallel_inference_coordinator.py

+        identities = self.identities_of_data_parallel_ranks
+        if not identities:
+            raise RuntimeError("No engines connected")
+        idx = self._round_robin_idx % len(identities)
+        self._round_robin_idx = idx + 1
+        return identities[idx]


Doing this because the list of engines connected to the coordinator is not static, so we can't build the iterator ahead of time.

tdene · 2026-02-26T10:56:09Z

megatron/core/inference/headers.py

    SUBMIT_REQUEST = auto()
    ENGINE_REPLY = auto()
    PAUSE = auto()
-    PAUSE_ACK = auto()


Remove the ACK logic altogether as it is very flaky.

tdene · 2026-02-26T10:57:26Z

megatron/core/inference/inference_client.py

        self.running = asyncio.Event()
-        self.paused = asyncio.Event()
-        self.stopped = asyncio.Event()


None of these are actually useful. They were added when the ACK logic would propagate readiness of engine up into the client. But that's very flaky.

ah nice, the inference client is simple again. Love this

tdene · 2026-02-26T11:00:34Z

megatron/core/inference/inference_client.py

        reply = msgpack.unpackb(self.socket.recv(), raw=False)[0]
        assert Headers(reply) == Headers.CONNECT_ACK

-    async def start(self, loop: Optional[asyncio.AbstractEventLoop] = None):


Making this async serves no purpose. _connect_with_inference_coordinator already blocks until the appropriate time.

tdene · 2026-03-03T19:29:29Z

@tdene how do we test that this:

Does not cause any slowdown on top of what we have.

Does not lead to the accuracy regressions

In terms of accuracy regression, I don't think we currently have the ability to test this for any PR. That said, this does not affect rollout generation, so I do not see a way in which it could cause accuracy regression.

In terms of slowdown, there's functional tests in CI to check against slowdown. Also this PR actives a unit-test to verify the coordinator does not slow down (and it does not slow down; this unit test was written but inactive for a few months now).

yobibyte · 2026-03-03T19:56:21Z

@tdene how do we test that this:

Does not cause any slowdown on top of what we have.

Does not lead to the accuracy regressions

In terms of accuracy regression, I don't think we currently have the ability to test this for any PR. That said, this does not affect rollout generation, so I do not see a way in which it could cause accuracy regression.

In terms of slowdown, there's functional tests in CI to check against slowdown. Also this PR actives a unit-test to verify the coordinator does not slow down (and it does not slow down; this unit test was written but inactive for a few months now).

Could you, please, run just functional with this before/after and compare the time it takes to finish? You can prob use my test harness for that.

…gnaling

tdene · 2026-03-04T10:17:43Z

@tdene how do we test that this:

Does not cause any slowdown on top of what we have.

Does not lead to the accuracy regressions

In terms of accuracy regression, I don't think we currently have the ability to test this for any PR. That said, this does not affect rollout generation, so I do not see a way in which it could cause accuracy regression.
In terms of slowdown, there's functional tests in CI to check against slowdown. Also this PR actives a unit-test to verify the coordinator does not slow down (and it does not slow down; this unit test was written but inactive for a few months now).

Could you, please, run just functional with this before/after and compare the time it takes to finish? You can prob use my test harness for that.

Done! They both took about 45 minutes, with this branch finishing slightly early at 02:15:16.136000, and main finishing at 02:15:29.290000.

yobibyte

Hey! Left some comments. I did not check the FSM logic properly, inference folks would be a better fit for this.

yobibyte · 2026-03-04T10:45:48Z

megatron/core/inference/engines/dynamic_engine.py

-        self.received_stop: bool = False
-        self.suspend_signal = False
-        self.is_suspended = False
+        for attr in self._STATE_EVENTS.values():


I personally dislike setattr() as they make code less readable and less observable even in the debugger. IMO, it'll be much nicer to have a self._state_events as:

self._state_events = {k: asyncio.Event(),for k in _STATE_EVENTS}

With this, you do not need the string representation, and you can easily print the state by just printing the state_events dict.

yobibyte · 2026-03-04T10:49:02Z

megatron/core/inference/engines/dynamic_engine.py

-        # coordinator.
-        if self.is_suspended:
+        # Skip if already suspended or in the process of suspending.
+        if self.state in (EngineState.SUSPENDED, EngineState.SUSPENDING):


I will either make a method that checks this or make SUS_STATES=(EngineState.SUSPENDED, EngineState.SUSPENDING) and check self.state in SUS_STATES.

This will be easier to maintain in case we want to add another state here. Otherwise, we will have to modify each if self.state in ... which you duplicate.

I disagree with this one for two reasons:

This call is in the suspend method, and this method can be assumed to take responsibility.

The list of states being checked here should never changed. It will always be SUSPENDED and SUSPENDING.

I got triggered because you have self.state in (SUSPENDING, SUSPEND) 3 or 4 times in this PR.

megatron/core/inference/engines/dynamic_engine.py

yobibyte · 2026-03-04T10:52:27Z

megatron/core/inference/text_generation_server/dynamic_text_gen_server/flask_server.py

        return "Megatron Dynamic Inference Server is running."

    loop = asyncio.get_event_loop()
+    executor = ThreadPoolExecutor(max_workers=8192)


There should be a constant for this.

Addressed by reverting all changes to this file; #3648 will soon rewrite it anyway.

svcnvidia-nemo-ci · 2026-03-04T20:48:07Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22688597937

svcnvidia-nemo-ci · 2026-03-04T22:45:31Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22692993863

svcnvidia-nemo-ci · 2026-03-05T00:15:44Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22695853007

svcnvidia-nemo-ci · 2026-03-05T03:36:18Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22701050201

tdene requested review from a team as code owners February 24, 2026 16:59

svcnvidia-nemo-ci added this to the Core 0.16 milestone Feb 24, 2026

svcnvidia-nemo-ci requested a review from a team February 24, 2026 17:00

tdene force-pushed the tde/robust_coordinator_signaling branch 2 times, most recently from 8d030c7 to d07419d Compare February 24, 2026 18:30

copy-pr-bot bot temporarily deployed to test February 25, 2026 12:48 Inactive

tdene force-pushed the tde/robust_coordinator_signaling branch from 24dfc98 to 84000ba Compare February 25, 2026 12:54

copy-pr-bot bot had a problem deploying to test February 25, 2026 12:55 Error

copy-pr-bot bot temporarily deployed to test February 25, 2026 13:10 Inactive

copy-pr-bot bot temporarily deployed to test February 25, 2026 14:25 Inactive

tdene commented Feb 26, 2026

View reviewed changes

tdene added 14 commits February 26, 2026 08:55

Fix the coordinator signaling logic

ff01c9a

Fix coordinator shutdown

b3e59f0

Fix coordinator unit tests

f8c8025

Address reviewer comments

fc9d093

Address reviewer comments

8d6eeee

Address reviewer comments

9d0ff35

Cleanup

bb961a3

Address reviewer comments

a87ea0b

Cleanup

d0c19a9

Fix cleanup

3fb373c

Cleanup

291b820

Cleanup

683312f

Cleanup

526ccb2

Correct shutdown

178680c

ko3n1g approved these changes Mar 3, 2026

View reviewed changes

Merge remote-tracking branch 'gh/main' into tde/robust_coordinator_si…

d025a9c

…gnaling

copy-pr-bot bot temporarily deployed to test March 3, 2026 21:02 Inactive

Fix lack of barrier after resume

c3a315e

tdene force-pushed the tde/robust_coordinator_signaling branch from b9c4063 to c3a315e Compare March 4, 2026 00:42

copy-pr-bot bot temporarily deployed to test March 4, 2026 00:43 Inactive

Merge remote-tracking branch 'gh/main' into tde/robust_coordinator_si…

78af3ca

…gnaling

copy-pr-bot bot temporarily deployed to test March 4, 2026 00:57 Inactive

shanmugamr1992 approved these changes Mar 4, 2026

View reviewed changes

Fix merge

80c3f20

copy-pr-bot bot temporarily deployed to test March 4, 2026 02:47 Inactive

add suspend/resume to moe functional test

e3ea13f

copy-pr-bot bot temporarily deployed to test March 4, 2026 06:13 Inactive

yobibyte reviewed Mar 4, 2026

View reviewed changes

Address reviewer comments

baa8b0c

copy-pr-bot bot temporarily deployed to test March 4, 2026 11:36 Inactive

Fix more merge conflicts

e9d1070

copy-pr-bot bot temporarily deployed to test March 4, 2026 11:51 Inactive

yobibyte approved these changes Mar 4, 2026

View reviewed changes

sidsingh-nvidia approved these changes Mar 4, 2026

View reviewed changes

kvareddy approved these changes Mar 4, 2026

View reviewed changes

tdene enabled auto-merge March 4, 2026 19:50

tdene added this pull request to the merge queue Mar 4, 2026

Conversation

tdene commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

State Diagram

PAUSE Protocol (details)

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

janEbert commented Feb 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdene commented Mar 3, 2026

Uh oh!

yobibyte commented Mar 3, 2026

Uh oh!

tdene commented Mar 4, 2026

Uh oh!

yobibyte left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

svcnvidia-nemo-ci commented Mar 4, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 4, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 5, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

tdene commented Feb 24, 2026 •

edited

Loading

(Step 1): Add PR label `Expert Review`