Skip to content

Conversation

@JenniferWang
Copy link
Contributor

Summary:

tl;dr

Adds ForgeMonarchExecutor and ForgeWorkerWrapper to enable weight synchronization
via TorchStore for RL training loops (e.g., GRPO). Specifically, the diff serialize the TochStore controller Actor to MonarchExecutor for sharing the controller.

Test Plan

[-] Weight update correctness test: TORCHSTORE_RDMA_ENABLED=0 PYTHONPATH=. pytest -s tests/integration_tests/test_policy_update.py::TestWeightSync::test_sanity_check --config tests/integration_tests/fixtures/qwen3_1_7b_tp.yaml
[-] Local host: python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
[-] Remote host: https://www.internalfb.com/msl/studio/runs/mast/qwen3_1_7b_mast-cve6ce%3APRODUCTION%3A0/logs?attempt=0&taskGroups=trainer%3A0%2Cref_model_0%3A0%2Cgenerator_0%3A0%2Cclient%3A0&statusFilter=PENDING%2CRUNNING%2CCOMPLETE%2CFAILED%2CABANDONED%2CSTOPPING&logarithm=%7B%22after%22%3A10%2C%22before%22%3A20%7D

Next Steps

[ ] implement the prefetch logic & shared memory
[ ] Add metric similar to generator v0
[ ] Perf/Throughput testing compared to generator v0

Differential Revision: D90775552

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 17, 2026
@meta-codesync
Copy link

meta-codesync bot commented Jan 17, 2026

@JenniferWang has exported this pull request. If you are a Meta employee, you can view the originating Diff in D90775552.

@JenniferWang JenniferWang linked an issue Jan 17, 2026 that may be closed by this pull request
2 tasks
facebook-github-bot pushed a commit that referenced this pull request Jan 21, 2026
Summary:

## tl;dr
Adds ForgeMonarchExecutor and ForgeWorkerWrapper to enable weight synchronization
via TorchStore for RL training loops (e.g., GRPO). Specifically, the diff serialize the TochStore controller Actor to MonarchExecutor for sharing the controller. 

## Test Plan
[-] Weight update correctness test: `TORCHSTORE_RDMA_ENABLED=0  
PYTHONPATH=. pytest -s tests/integration_tests/test_policy_update.py::TestWeightSync::test_sanity_check  --config tests/integration_tests/fixtures/qwen3_1_7b_tp.yaml`
[-] Local host: `python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml`
[-] Remote host: https://www.internalfb.com/msl/studio/runs/mast/qwen3_1_7b_mast-cve6ce%3APRODUCTION%3A0/logs?attempt=0&taskGroups=trainer%3A0%2Cref_model_0%3A0%2Cgenerator_0%3A0%2Cclient%3A0&statusFilter=PENDING%2CRUNNING%2CCOMPLETE%2CFAILED%2CABANDONED%2CSTOPPING&logarithm=%7B%22after%22%3A10%2C%22before%22%3A20%7D

## Next Steps
[ ] implement the prefetch logic & shared memory
[ ] Add metric similar to generator v0
[ ] Perf/Throughput testing compared to generator v0

Differential Revision: D90775552
facebook-github-bot pushed a commit that referenced this pull request Jan 21, 2026
Summary:

## tl;dr
Adds ForgeMonarchExecutor and ForgeWorkerWrapper to enable weight synchronization
via TorchStore for RL training loops (e.g., GRPO). Specifically, the diff serialize the TochStore controller Actor to MonarchExecutor for sharing the controller. 

## Test Plan
[-] Weight update correctness test: `TORCHSTORE_RDMA_ENABLED=0  
PYTHONPATH=. pytest -s tests/integration_tests/test_policy_update.py::TestWeightSync::test_sanity_check  --config tests/integration_tests/fixtures/qwen3_1_7b_tp.yaml`
[-] Local host: `python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml`
[-] Remote host: https://www.internalfb.com/msl/studio/runs/mast/qwen3_1_7b_mast-cve6ce%3APRODUCTION%3A0/logs?attempt=0&taskGroups=trainer%3A0%2Cref_model_0%3A0%2Cgenerator_0%3A0%2Cclient%3A0&statusFilter=PENDING%2CRUNNING%2CCOMPLETE%2CFAILED%2CABANDONED%2CSTOPPING&logarithm=%7B%22after%22%3A10%2C%22before%22%3A20%7D

## Next Steps
[ ] implement the prefetch logic & shared memory
[ ] Add metric similar to generator v0
[ ] Perf/Throughput testing compared to generator v0

Differential Revision: D90775552
facebook-github-bot pushed a commit that referenced this pull request Jan 21, 2026
Summary:

## tl;dr
Adds ForgeMonarchExecutor and ForgeWorkerWrapper to enable weight synchronization
via TorchStore for RL training loops (e.g., GRPO). Specifically, the diff serialize the TochStore controller Actor to MonarchExecutor for sharing the controller. 

## Test Plan
[-] Weight update correctness test: `TORCHSTORE_RDMA_ENABLED=0  
PYTHONPATH=. pytest -s tests/integration_tests/test_policy_update.py::TestWeightSync::test_sanity_check  --config tests/integration_tests/fixtures/qwen3_1_7b_tp.yaml`
[-] Local host: `python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml`
[-] Remote host: https://www.internalfb.com/msl/studio/runs/mast/qwen3_1_7b_mast-cve6ce%3APRODUCTION%3A0/logs?attempt=0&taskGroups=trainer%3A0%2Cref_model_0%3A0%2Cgenerator_0%3A0%2Cclient%3A0&statusFilter=PENDING%2CRUNNING%2CCOMPLETE%2CFAILED%2CABANDONED%2CSTOPPING&logarithm=%7B%22after%22%3A10%2C%22before%22%3A20%7D

## Next Steps
[ ] implement the prefetch logic & shared memory
[ ] Add metric similar to generator v0
[ ] Perf/Throughput testing compared to generator v0

Differential Revision: D90775552
Summary:

## tl;dr
Adds ForgeMonarchExecutor and ForgeWorkerWrapper to enable weight synchronization
via TorchStore for RL training loops (e.g., GRPO). Specifically, the diff serialize the TochStore controller Actor to MonarchExecutor for sharing the controller. 

## Test Plan
[-] Weight update correctness test: `TORCHSTORE_RDMA_ENABLED=0  
PYTHONPATH=. pytest -s tests/integration_tests/test_policy_update.py::TestWeightSync::test_sanity_check  --config tests/integration_tests/fixtures/qwen3_1_7b_tp.yaml`
[-] Local host: `python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml`
[-] Remote host: https://www.internalfb.com/msl/studio/runs/mast/qwen3_1_7b_mast-cve6ce%3APRODUCTION%3A0/logs?attempt=0&taskGroups=trainer%3A0%2Cref_model_0%3A0%2Cgenerator_0%3A0%2Cclient%3A0&statusFilter=PENDING%2CRUNNING%2CCOMPLETE%2CFAILED%2CABANDONED%2CSTOPPING&logarithm=%7B%22after%22%3A10%2C%22before%22%3A20%7D

## Next Steps
[ ] implement the prefetch logic & shared memory
[ ] Add metric similar to generator v0
[ ] Perf/Throughput testing compared to generator v0

Differential Revision: D90775552
facebook-github-bot pushed a commit that referenced this pull request Jan 23, 2026
Summary:

## tl;dr
Adds ForgeMonarchExecutor and ForgeWorkerWrapper to enable weight synchronization
via TorchStore for RL training loops (e.g., GRPO). Specifically, the diff serialize the TochStore controller Actor to MonarchExecutor for sharing the controller. 

## Test Plan
[-] Weight update correctness test: `TORCHSTORE_RDMA_ENABLED=0  
PYTHONPATH=. pytest -s tests/integration_tests/test_policy_update.py::TestWeightSync::test_sanity_check  --config tests/integration_tests/fixtures/qwen3_1_7b_tp.yaml`
[-] Local host: `python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml`
[-] Remote host: https://www.internalfb.com/msl/studio/runs/mast/qwen3_1_7b_mast-cve6ce%3APRODUCTION%3A0/logs?attempt=0&taskGroups=trainer%3A0%2Cref_model_0%3A0%2Cgenerator_0%3A0%2Cclient%3A0&statusFilter=PENDING%2CRUNNING%2CCOMPLETE%2CFAILED%2CABANDONED%2CSTOPPING&logarithm=%7B%22after%22%3A10%2C%22before%22%3A20%7D

## Next Steps
[ ] implement the prefetch logic & shared memory
[ ] Add metric similar to generator v0
[ ] Perf/Throughput testing compared to generator v0

Differential Revision: D90775552
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.40%. Comparing base (080770c) to head (e475eb3).
⚠️ Report is 14 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #710      +/-   ##
==========================================
- Coverage   78.33%   71.40%   -6.93%     
==========================================
  Files          36       41       +5     
  Lines        4209     4288      +79     
==========================================
- Hits         3297     3062     -235     
- Misses        912     1226     +314     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

facebook-github-bot pushed a commit that referenced this pull request Jan 23, 2026
Summary:

## tl;dr
Adds ForgeMonarchExecutor and ForgeWorkerWrapper to enable weight synchronization
via TorchStore for RL training loops (e.g., GRPO). Specifically, the diff serialize the TochStore controller Actor to MonarchExecutor for sharing the controller. 

## Test Plan
[-] Weight update correctness test: `TORCHSTORE_RDMA_ENABLED=0  
PYTHONPATH=. pytest -s tests/integration_tests/test_policy_update.py::TestWeightSync::test_sanity_check  --config tests/integration_tests/fixtures/qwen3_1_7b_tp.yaml`
[-] Local host: `python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml`
[-] Remote host: https://www.internalfb.com/msl/studio/runs/mast/qwen3_1_7b_mast-cve6ce%3APRODUCTION%3A0/logs?attempt=0&taskGroups=trainer%3A0%2Cref_model_0%3A0%2Cgenerator_0%3A0%2Cclient%3A0&statusFilter=PENDING%2CRUNNING%2CCOMPLETE%2CFAILED%2CABANDONED%2CSTOPPING&logarithm=%7B%22after%22%3A10%2C%22before%22%3A20%7D

## Next Steps
[ ] implement the prefetch logic & shared memory
[ ] Add metric similar to generator v0
[ ] Perf/Throughput testing compared to generator v0

Differential Revision: D90775552
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[vLLM v0.13] Re-architect forge's integration with vLLM (generator.py)

3 participants