Skip to content

Conversation

@JenniferWang
Copy link
Contributor

Summary:

tl;dr

Add tracer in v1 to log perf metrics to wandb

V0 vs V1 Metrics Parity Comparison

Category v0 Metric v1 Metric Parity
Generate - Request Count generator/generate/count_requests (SUM) generator/generate/count_requests (SUM) ✅ Same
Generate - Completion Count generator/generate/count_sequences_completed (SUM) generator/generate/count_sequences_completed (SUM) ✅ Same
Generate - E2E Timing generator_perf/generate/* (Tracer, GPU) generator_perf/generate/* (Tracer, GPU) ✅ Same
Update - Pending Requests generator_perf/update_weights/sum_pending_gen_requests (SUM) N/A - AsyncLLM handles internally ⚠️ Skip (by design)
Update - Wait for Generation generator_perf/update_weights/avg_waiting_for_generation_duration_s (MEAN) generator_perf/update_weights/pause_generation_duration_s (MEAN) ✅ Equivalent - renamed for clarity
Update - Fetch Weights generator_perf/update_weights/wait_fetch_weights (MEAN) generator_perf/update_weights/worker_load_weights_duration_s (MEAN) ✅ Equivalent - renamed for clarity
Worker - Update Timing generator_perf/update_weights/generator_worker_update/* (trace, GPU) generator_perf/update_weights/generator_worker_update/* (trace, GPU) ✅ Same

Test Plan

Main GRPO app: python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml

wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run drawn-waterfall-686
wandb: ⭐️ View project at https://meta.wandb.io/jiyue/grpo-training
wandb: 🚀 View run at https://meta.wandb.io/jiyue/grpo-training/runs/6pltx38p
wandb: Detected [openai] in use.
....
rvability.metric_actors.GlobalLoggingActor global_logger>] === [global_reduce] - METRICS STEP 1 ===
  ...
  generator/generate/count_requests: 13.0
  generator/generate/count_sequences_completed: 96.0
  generator_perf/generate/total_duration_avg_s: 3.6518315022786463
  generator_perf/generate/total_duration_max_s: 9.2080615234375
  generator_perf/update_weights/pause_generation_duration_s: 2.8634108749683946
  generator_perf/update_weights/resume_generation_duration_s: 1.918897032737732e-05
  generator_perf/update_weights/worker_load_weights_duration_s: 3.506648204056546
  ...

Make sure integration tests that do not initialize the tracer still works
pytest tests/integration_tests/test_generator_lifecycle.py -v -s

Next Steps

[ ] implement the prefetch logic & shared memory
[-] Add metric similar to generator v0
[ ] Perf/Throughput testing compared to generator v0

Differential Revision: D91038187

@meta-codesync
Copy link

meta-codesync bot commented Jan 21, 2026

@JenniferWang has exported this pull request. If you are a Meta employee, you can view the originating Diff in D91038187.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 21, 2026
@JenniferWang JenniferWang linked an issue Jan 21, 2026 that may be closed by this pull request
2 tasks
Summary:

## Summary

This diff introduces vLLM v1 integration for forge & Monarch that works for version > 0.13.0. 

Functionality wise, this diff implements:
  - Single-node TP (unoptimized, TCP-based proc communication)
  - Multi-node TP (same TCP mechanism)

Pending work (next diff stack): First focus on Single-node TP
  - Unix socket-based communication (instead of TCP)
  - Weight sync integration
  - Logging integration

After that, we can introduce Pipeline Parallelism:
  - Extend executor to capture stage graph (DAG-like execution pattern)

## Decisions 1: Integration Layer -- `AsyncLLM`

We integrate at the AsyncLLM layer (https://blog.vllm.ai/2025/01/27/v1-alpha-release.html), which sits higher in the stack compared to our v0 approach that disassembled EngineCore and integrated at the Worker level. We pick this layer for these main reasons
1. Reduced maintenance cost: vLLM v1 refactored internals significantly (new EngineCore, Scheduler, KVCacheManager). Integrating at AsyncLLM isolates us from these changes -- we only need to implement the Executor interface, not patch internal scheduling or memory management.
2. Better fit for agentic RL: The offline LLM class batches requests synchronously via `llm.generate([prompts])`. AsyncLLM exposes an async generator interface (async for output in llm.generate(prompt)) that supports streaming, priority scheduling, and concurrent request handling required for online RL rollouts.

## Decision 2: Extension Points -- Executor + WorkerWrapperBase
| **Class**              | **Base Class**                                         | **Location**            | **Purpose**                                                                          |
|------------------------|--------------------------------------------------------|-------------------------|--------------------------------------------------------------------------------------|
| MonarchExecutor        | vllm.v1.executor.abstract.Executor                     | monarch_executor.py     | Creates ProcMesh from HostMesh, spawns workers, manages collective_rpc() dispatch.  |
| WorkerWrapper          | vllm.v1.worker.worker_base.WorkerWrapperBase + Actor   | monarch_executor.py     | Dual-inheritance wrapper exposing vLLM worker methods as Monarch endpoints.         |
| ForgeMonarchExecutor (next diff)   | MonarchExecutor                                        | forge_executor.py       | Extends executor with TorchStore Controller handling for weight updates.            |
| ForgeWorkerWrapper (next diff)     | WorkerWrapper                                          | forge_executor.py       | Extends worker with TorchStore weight loading capabilities.                         |
| Generator              | ForgeActor                                             | generator.py            | Forge-specific orchestration: provisions hosts, allocates GPUs, manages AsyncLLM.   |

**`MonarchExecutor` and `WorkerWrapper` are designed to be upstreamed to vLLM alongside the existing `RayDistributedExecutor`, enabling Monarch as a first-class distributed backend.**


## Decision 3: Executor Owns Workers Lifecycle

The architecture aligns closer with vLLM's Ray executor pattern where:
- **Caller (Generator) owns HostMesh**: Resource allocation (hosts, GPU IDs)
- **Executor owns ProcMesh + Workers**: Execution lifecycle

```
    ┌───────────────────────────────────────────────────────────────────────┐
    │                              Host Mesh                                │
    │                                                                       │
    │  ┌─────────────────────────────────────────────────────────────────┐  │
    │  │ Caller process                                                  │  │
    │  │                                                                 │  │
    │  │  ┌─────────────────────┐       ┌─────────────────────────────┐  │  │
    │  │  │ AsyncLLM            │       │ WorkerRegistry (actor)      │  │  │
    │  │  └─────────────────────┘       └─────────────────────────────┘  │  │
    │  │            │                                                    │  │
    │  │            │ serialize host_mesh & registry to env vars         │  │
    │  │            ▼                                                    │  │
    │  │  ┌───────────────────────────────────────────────────────────┐  │  │
    │  │  │ EngineCore subprocess                                     │  │  │
    │  │  │                                                           │  │  │
    │  │  │ MonarchExecutor                                           │  │  │
    │  │  │   ├── deserialize host_mesh                               │  │  │
    │  │  │   ├── create proc_mesh from host_mesh (owns lifecycle) ───│──│──│──┐
    │  │  │   ├── spawn worker actors on proc_mesh                    │  │  │  │
    │  │  │   └── register workers in WorkerRegistry                  │  │  │  │
    │  │  └───────────────────────────────────────────────────────────┘  │  │  │
    │  └─────────────────────────────────────────────────────────────────┘  │  │
    │                                                                       │  │
    │  ┌─────────────────────────────────────────────────────────────────┐  │  │
    │  │ GPU ProcMesh (owned by MonarchExecutor)                         │  │  │
    │  │                                                                 │  │  │
    │  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐         │  │  │
    │  │  │ Worker 0 │  │ Worker 1 │  │ Worker 2 │  │ Worker 3 │  ... ◀──│──│──┘
    │  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘         │  │
    │  │                   ◀──── NCCL (tensor parallel) ────▶            │  │
    │  └─────────────────────────────────────────────────────────────────┘  │
    └───────────────────────────────────────────────────────────────────────┘
```

**Design**: Caller owns host_mesh (resource allocation), executor owns proc_mesh + workers (execution). This mirrors vLLM's Ray executor pattern. Since we want to collocate Generator Actor with the worker host mesh, it's easier to stick to caller owning host mesh 

**WorkerRegistry** bridges the process boundary -- MonarchExecutor (in subprocess) registers workers there, Generator queries it after AsyncLLM initialization.

**Executor Cleanup Responsibility**: 
Since MonarchExecutor creates proc_mesh from host_mesh, it owns the cleanup:
1. `MonarchExecutor.shutdown()` destroys process groups on workers (prevents NCCL errors)
2. Stops proc_mesh
3. `Generator.shutdown()` only needs to stop generator_proc


## Limitations

- **TP**: Supported (single-node and multi-node)
- **PP**: NOT supported (would require DAG-like execution pattern)
- Shared memory cache (`mm_processor_cache_type='shm'`) not supported
- Symmetric memory all-reduce disabled (`VLLM_ALLREDUCE_USE_SYMM_MEM=0`)

## Test Plan
[-] Resource / Lifecycle: `pytest tests/integration_tests/test_generator_lifecycle.py -v -s`
[-] Single node TP local benchmark throughput test: `python -m benchmarks.generator.throughput --config apps/grpo/qwen3_1_7b.yaml benchmark.num_requests=10 benchmark.dataset=fixed benchmark.fixed_prompt="Tell me a joke" benchmark.num_samples=5` to verify the vllm instantiation on local host.
[-] Single node TP MAST benchmark throughput test to verify vllm instantiation on remote host:  https://www.internalfb.com/msl/studio/runs/mast/qwen3_1_7b_mast-eh7o6d%3APRODUCTION%3A0/logs?attempt=0&taskGroups=client%3A0&statusFilter=PENDING%2CRUNNING%2CCOMPLETE%2CFAILED%2CABANDONED%2CSTOPPING&logarithm=%7B%22after%22%3A10%2C%22before%22%3A20%7D
[-] Multi-node (TP) MAST benchmark throughput test: https://www.internalfb.com/msl/studio/runs/mast/qwen3_1_7b_multinode_test-gr8aes%3APRODUCTION%3A0/logs?attempt=0&taskGroups=client%3A0&statusFilter=PENDING%2CRUNNING%2CCOMPLETE%2CFAILED%2CABANDONED%2CSTOPPING&logarithm=%7B%22after%22%3A10%2C%22before%22%3A20%7D

Reviewed By: allenwang28

Differential Revision: D90280578
Summary:

## tl;dr
Adds ForgeMonarchExecutor and ForgeWorkerWrapper to enable weight synchronization
via TorchStore for RL training loops (e.g., GRPO). Specifically, the diff serialize the TochStore controller Actor to MonarchExecutor for sharing the controller. 

## Test Plan
[-] Weight update correctness test: `TORCHSTORE_RDMA_ENABLED=0  
PYTHONPATH=. pytest -s tests/integration_tests/test_policy_update.py::TestWeightSync::test_sanity_check  --config tests/integration_tests/fixtures/qwen3_1_7b_tp.yaml`
[-] Local host: `python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml`
[-] Remote host: https://www.internalfb.com/msl/studio/runs/mast/qwen3_1_7b_mast-cve6ce%3APRODUCTION%3A0/logs?attempt=0&taskGroups=trainer%3A0%2Cref_model_0%3A0%2Cgenerator_0%3A0%2Cclient%3A0&statusFilter=PENDING%2CRUNNING%2CCOMPLETE%2CFAILED%2CABANDONED%2CSTOPPING&logarithm=%7B%22after%22%3A10%2C%22before%22%3A20%7D

## Next Steps
[ ] implement the prefetch logic & shared memory
[ ] Add metric similar to generator v0
[ ] Perf/Throughput testing compared to generator v0

Differential Revision: D90775552
Summary:

## tl;dr
Add tracer in v1 to log perf metrics to wandb 

## V0 vs V1 Metrics Parity Comparison

| Category | v0 Metric | v1 Metric | Parity |
|----------|-----------|-----------|--------|
| **Generate - Request Count** | `generator/generate/count_requests` (SUM) | `generator/generate/count_requests` (SUM) | ✅ Same |
| **Generate - Completion Count** | `generator/generate/count_sequences_completed` (SUM) | `generator/generate/count_sequences_completed` (SUM) | ✅ Same |
| **Generate - E2E Timing** | `generator_perf/generate/*` (Tracer, GPU) | `generator_perf/generate/*` (Tracer, GPU) | ✅ Same |
| **Update - Pending Requests** | `generator_perf/update_weights/sum_pending_gen_requests` (SUM) | N/A - AsyncLLM handles internally | ⚠️ Skip (by design) |
| **Update - Wait for Generation** | `generator_perf/update_weights/avg_waiting_for_generation_duration_s` (MEAN) | `generator_perf/update_weights/pause_generation_duration_s` (MEAN) | ✅ Equivalent - renamed for clarity |
| **Update - Fetch Weights** | `generator_perf/update_weights/wait_fetch_weights` (MEAN) | `generator_perf/update_weights/worker_load_weights_duration_s` (MEAN) | ✅ Equivalent - renamed for clarity |
| **Worker - Update Timing** | `generator_perf/update_weights/generator_worker_update/*` (trace, GPU) | `generator_perf/update_weights/generator_worker_update/*` (trace, GPU) | ✅ Same |

## Test Plan

Main GRPO app: `python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml`

```
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run drawn-waterfall-686
wandb: ⭐️ View project at https://meta.wandb.io/jiyue/grpo-training
wandb: 🚀 View run at https://meta.wandb.io/jiyue/grpo-training/runs/6pltx38p
wandb: Detected [openai] in use.
....
rvability.metric_actors.GlobalLoggingActor global_logger>] === [global_reduce] - METRICS STEP 1 ===
  ...
  generator/generate/count_requests: 13.0
  generator/generate/count_sequences_completed: 96.0
  generator_perf/generate/total_duration_avg_s: 3.6518315022786463
  generator_perf/generate/total_duration_max_s: 9.2080615234375
  generator_perf/update_weights/pause_generation_duration_s: 2.8634108749683946
  generator_perf/update_weights/resume_generation_duration_s: 1.918897032737732e-05
  generator_perf/update_weights/worker_load_weights_duration_s: 3.506648204056546
  ...
```


Make sure integration tests that do not initialize the tracer still works 
`pytest tests/integration_tests/test_generator_lifecycle.py -v -s`


## Next Steps
[ ] implement the prefetch logic & shared memory
[-] Add metric similar to generator v0
[ ] Perf/Throughput testing compared to generator v0

Differential Revision: D91038187
Copy link
Contributor

@allenwang28 allenwang28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review automatically exported from Phabricator review in Meta.

facebook-github-bot pushed a commit that referenced this pull request Jan 23, 2026
Summary:

## tl;dr
Add tracer in v1 to log perf metrics to wandb 

## V0 vs V1 Metrics Parity Comparison

| Category | v0 Metric | v1 Metric | Parity |
|----------|-----------|-----------|--------|
| **Generate - Request Count** | `generator/generate/count_requests` (SUM) | `generator/generate/count_requests` (SUM) | ✅ Same |
| **Generate - Completion Count** | `generator/generate/count_sequences_completed` (SUM) | `generator/generate/count_sequences_completed` (SUM) | ✅ Same |
| **Generate - E2E Timing** | `generator_perf/generate/*` (Tracer, GPU) | `generator_perf/generate/*` (Tracer, GPU) | ✅ Same |
| **Update - Pending Requests** | `generator_perf/update_weights/sum_pending_gen_requests` (SUM) | N/A - AsyncLLM handles internally | ⚠️ Skip (by design) |
| **Update - Wait for Generation** | `generator_perf/update_weights/avg_waiting_for_generation_duration_s` (MEAN) | `generator_perf/update_weights/pause_generation_duration_s` (MEAN) | ✅ Equivalent - renamed for clarity |
| **Update - Fetch Weights** | `generator_perf/update_weights/wait_fetch_weights` (MEAN) | `generator_perf/update_weights/worker_load_weights_duration_s` (MEAN) | ✅ Equivalent - renamed for clarity |
| **Worker - Update Timing** | `generator_perf/update_weights/generator_worker_update/*` (trace, GPU) | `generator_perf/update_weights/generator_worker_update/*` (trace, GPU) | ✅ Same |

## Test Plan

Main GRPO app: `python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml`

```
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run drawn-waterfall-686
wandb: ⭐️ View project at https://meta.wandb.io/jiyue/grpo-training
wandb: 🚀 View run at https://meta.wandb.io/jiyue/grpo-training/runs/6pltx38p
wandb: Detected [openai] in use.
....
rvability.metric_actors.GlobalLoggingActor global_logger>] === [global_reduce] - METRICS STEP 1 ===
  ...
  generator/generate/count_requests: 13.0
  generator/generate/count_sequences_completed: 96.0
  generator_perf/generate/total_duration_avg_s: 3.6518315022786463
  generator_perf/generate/total_duration_max_s: 9.2080615234375
  generator_perf/update_weights/pause_generation_duration_s: 2.8634108749683946
  generator_perf/update_weights/resume_generation_duration_s: 1.918897032737732e-05
  generator_perf/update_weights/worker_load_weights_duration_s: 3.506648204056546
  ...
```


Make sure integration tests that do not initialize the tracer still works 
`pytest tests/integration_tests/test_generator_lifecycle.py -v -s`


## Next Steps
[ ] implement the prefetch logic & shared memory
[-] Add metric similar to generator v0
[ ] Perf/Throughput testing compared to generator v0

Reviewed By: allenwang28

Differential Revision: D91038187
facebook-github-bot pushed a commit that referenced this pull request Jan 23, 2026
Summary:

## tl;dr
Add tracer in v1 to log perf metrics to wandb 

## V0 vs V1 Metrics Parity Comparison

| Category | v0 Metric | v1 Metric | Parity |
|----------|-----------|-----------|--------|
| **Generate - Request Count** | `generator/generate/count_requests` (SUM) | `generator/generate/count_requests` (SUM) | ✅ Same |
| **Generate - Completion Count** | `generator/generate/count_sequences_completed` (SUM) | `generator/generate/count_sequences_completed` (SUM) | ✅ Same |
| **Generate - E2E Timing** | `generator_perf/generate/*` (Tracer, GPU) | `generator_perf/generate/*` (Tracer, GPU) | ✅ Same |
| **Update - Pending Requests** | `generator_perf/update_weights/sum_pending_gen_requests` (SUM) | N/A - AsyncLLM handles internally | ⚠️ Skip (by design) |
| **Update - Wait for Generation** | `generator_perf/update_weights/avg_waiting_for_generation_duration_s` (MEAN) | `generator_perf/update_weights/pause_generation_duration_s` (MEAN) | ✅ Equivalent - renamed for clarity |
| **Update - Fetch Weights** | `generator_perf/update_weights/wait_fetch_weights` (MEAN) | `generator_perf/update_weights/worker_load_weights_duration_s` (MEAN) | ✅ Equivalent - renamed for clarity |
| **Worker - Update Timing** | `generator_perf/update_weights/generator_worker_update/*` (trace, GPU) | `generator_perf/update_weights/generator_worker_update/*` (trace, GPU) | ✅ Same |

## Test Plan

Main GRPO app: `python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml`

```
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run drawn-waterfall-686
wandb: ⭐️ View project at https://meta.wandb.io/jiyue/grpo-training
wandb: 🚀 View run at https://meta.wandb.io/jiyue/grpo-training/runs/6pltx38p
wandb: Detected [openai] in use.
....
rvability.metric_actors.GlobalLoggingActor global_logger>] === [global_reduce] - METRICS STEP 1 ===
  ...
  generator/generate/count_requests: 13.0
  generator/generate/count_sequences_completed: 96.0
  generator_perf/generate/total_duration_avg_s: 3.6518315022786463
  generator_perf/generate/total_duration_max_s: 9.2080615234375
  generator_perf/update_weights/pause_generation_duration_s: 2.8634108749683946
  generator_perf/update_weights/resume_generation_duration_s: 1.918897032737732e-05
  generator_perf/update_weights/worker_load_weights_duration_s: 3.506648204056546
  ...
```


Make sure integration tests that do not initialize the tracer still works 
`pytest tests/integration_tests/test_generator_lifecycle.py -v -s`


## Next Steps
[ ] implement the prefetch logic & shared memory
[-] Add metric similar to generator v0
[ ] Perf/Throughput testing compared to generator v0

Reviewed By: allenwang28

Differential Revision: D91038187
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[vLLM v0.13] Re-architect forge's integration with vLLM (generator.py)

3 participants