Add TorchStore weight sync to Generator v1 #710

JenniferWang · 2026-01-17T01:17:50Z

Summary:

tl;dr

Adds ForgeMonarchExecutor and ForgeWorkerWrapper to enable weight synchronization
via TorchStore for RL training loops (e.g., GRPO). Specifically, the diff serialize the TochStore controller Actor to MonarchExecutor for sharing the controller.

Test Plan

[-] Weight update correctness test: TORCHSTORE_RDMA_ENABLED=0 PYTHONPATH=. pytest -s tests/integration_tests/test_policy_update.py::TestWeightSync::test_sanity_check --config tests/integration_tests/fixtures/qwen3_1_7b_tp.yaml
[-] Local host: python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
[-] Remote host: https://www.internalfb.com/msl/studio/runs/mast/qwen3_1_7b_mast-cve6ce%3APRODUCTION%3A0/logs?attempt=0&taskGroups=trainer%3A0%2Cref_model_0%3A0%2Cgenerator_0%3A0%2Cclient%3A0&statusFilter=PENDING%2CRUNNING%2CCOMPLETE%2CFAILED%2CABANDONED%2CSTOPPING&logarithm=%7B%22after%22%3A10%2C%22before%22%3A20%7D

Next Steps

[ ] implement the prefetch logic & shared memory
[ ] Add metric similar to generator v0
[ ] Perf/Throughput testing compared to generator v0

Differential Revision: D90775552

meta-codesync · 2026-01-17T01:17:57Z

@JenniferWang has exported this pull request. If you are a Meta employee, you can view the originating Diff in D90775552.

Summary: ## tl;dr Adds ForgeMonarchExecutor and ForgeWorkerWrapper to enable weight synchronization via TorchStore for RL training loops (e.g., GRPO). Specifically, the diff serialize the TochStore controller Actor to MonarchExecutor for sharing the controller. ## Test Plan [-] Weight update correctness test: `TORCHSTORE_RDMA_ENABLED=0 PYTHONPATH=. pytest -s tests/integration_tests/test_policy_update.py::TestWeightSync::test_sanity_check --config tests/integration_tests/fixtures/qwen3_1_7b_tp.yaml` [-] Local host: `python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml` [-] Remote host: https://www.internalfb.com/msl/studio/runs/mast/qwen3_1_7b_mast-cve6ce%3APRODUCTION%3A0/logs?attempt=0&taskGroups=trainer%3A0%2Cref_model_0%3A0%2Cgenerator_0%3A0%2Cclient%3A0&statusFilter=PENDING%2CRUNNING%2CCOMPLETE%2CFAILED%2CABANDONED%2CSTOPPING&logarithm=%7B%22after%22%3A10%2C%22before%22%3A20%7D ## Next Steps [ ] implement the prefetch logic & shared memory [ ] Add metric similar to generator v0 [ ] Perf/Throughput testing compared to generator v0 Differential Revision: D90775552

codecov-commenter · 2026-01-23T18:26:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.40%. Comparing base (080770c) to head (e475eb3).
⚠️ Report is 14 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #710      +/-   ##
==========================================
- Coverage   78.33%   71.40%   -6.93%     
==========================================
  Files          36       41       +5     
  Lines        4209     4288      +79     
==========================================
- Hits         3297     3062     -235     
- Misses        912     1226     +314

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Summary: ## tl;dr Adds ForgeMonarchExecutor and ForgeWorkerWrapper to enable weight synchronization via TorchStore for RL training loops (e.g., GRPO). Specifically, the diff serialize the TochStore controller Actor to MonarchExecutor for sharing the controller. ## Test Plan [-] Weight update correctness test: `TORCHSTORE_RDMA_ENABLED=0 PYTHONPATH=. pytest -s tests/integration_tests/test_policy_update.py::TestWeightSync::test_sanity_check --config tests/integration_tests/fixtures/qwen3_1_7b_tp.yaml` [-] Local host: `python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml` [-] Remote host: https://www.internalfb.com/msl/studio/runs/mast/qwen3_1_7b_mast-cve6ce%3APRODUCTION%3A0/logs?attempt=0&taskGroups=trainer%3A0%2Cref_model_0%3A0%2Cgenerator_0%3A0%2Cclient%3A0&statusFilter=PENDING%2CRUNNING%2CCOMPLETE%2CFAILED%2CABANDONED%2CSTOPPING&logarithm=%7B%22after%22%3A10%2C%22before%22%3A20%7D ## Next Steps [ ] implement the prefetch logic & shared memory [ ] Add metric similar to generator v0 [ ] Perf/Throughput testing compared to generator v0 Differential Revision: D90775552

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 17, 2026

meta-codesync bot added fb-exported meta-exported labels Jan 17, 2026

JenniferWang linked an issue Jan 17, 2026 that may be closed by this pull request

[vLLM v0.13] Re-architect forge's integration with vLLM (generator.py) #669

Closed

2 tasks

facebook-github-bot force-pushed the export-D90775552 branch from 9c6905e to f4c88c2 Compare January 21, 2026 13:57

facebook-github-bot force-pushed the export-D90775552 branch from f4c88c2 to e475eb3 Compare January 23, 2026 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TorchStore weight sync to Generator v1 #710

Add TorchStore weight sync to Generator v1 #710

Uh oh!

JenniferWang commented Jan 17, 2026

Uh oh!

meta-codesync bot commented Jan 17, 2026

Uh oh!

codecov-commenter commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add TorchStore weight sync to Generator v1 #710

Are you sure you want to change the base?

Add TorchStore weight sync to Generator v1 #710

Uh oh!

Conversation

JenniferWang commented Jan 17, 2026

tl;dr

Test Plan

Next Steps

Uh oh!

meta-codesync bot commented Jan 17, 2026

Uh oh!

codecov-commenter commented Jan 23, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants