[recipe, VLA] feat: support isaac server mode for multitask libero config #4578

chenchaoxu7575 · 2025-12-17T15:26:39Z

What does this PR do?

Isaac Server Mode for Multi-Task LIBERO

This PR introduces a decoupled Isaac Lab simulation architecture that separates model inference (Gen) from physics simulation (Env), enabling efficient multi-task reinforcement learning with pipeline parallelism. The simulation is managed via Ray actors, supporting multi-node deployment.

Key Features:

IsaacServer - Ray actor that wraps Isaac Lab environment, runs on sim GPUs with multi-task support
IsaacServerManager - Manages multiple IsaacServers across stages and GPUs, handles task-to-server routing
EnvWorkerServer - Lightweight coordinator that routes actions to correct servers via traj_key mapping
TaskBalancedSampler - Ensures balanced task distribution in batches, with per-stage interleaving (tightly coupled with EnvWorkerServer stage assignment)
Pipeline-Parallel Rollout - Each stage has isolated servers, enabling GPU time-multiplexing between simulation and generation
Multi-Node Sim Support - Sim nodes can be distributed across multiple machines via Ray cluster

Test

Validated on LIBERO-10 benchmark with 10 manipulation tasks across 3 scenes (living room, kitchen, study).

Test Configuration:

2 SIM Nodes with 10 GPUs total for Isaac Simulation
1 TRAIN Node with 8 GPUs for model inference
2 Pipeline Stages (matching server groups)
10 tasks × 16 envs/task = 160 envs per stage
256×256 camera resolution

Results:

Successfully completed multi-task rollouts across all 10 LIBERO tasks
Pipeline overlap achieved between Gen and Sim
No environment state corruption between stages (verified via video recordings)
Ray timeline:

API and Usage Example

1. Start Ray Cluster (Multi-Node Setup):

# On head node (TRAIN NODE)
ray start --head --port=6379

# On SIM nodes (join cluster)
ray start --address=<head_node_ip>:6379 --resources='{"sim": 1}'

2. Run Training with Ray-managed Isaac Servers:

cd verl/recipe/vla

# Key configuration parameters
export NUM_TASKS=10               # LIBERO-10 tasks
export GROUP_SIZE=16              # Envs per task per stage
export STAGE_NUM=2                # Pipeline stages
export NUM_ISAAC_SERVERS=10       # Total sim GPUs
export SIM_NODES=2                # Number of sim nodes

# Launch training (servers are created automatically by Ray)
./run_simpleVLA_isaac_disagg_server.sh

3. Configuration in YAML:

env:
  train:
    isaac_server_mode: True       # Enable Ray actor mode
    num_isaac_servers: 10         # Servers per stage
    num_tasks: 10
    group_size: 16                # Envs per task
    total_trajs: 128              # Total trajectories for training
  rollout:
    pipeline_stage_num: 2
  disagg_sim:
    enable: True
    nnodes: 2                     # Number of sim nodes

4. Using TaskBalancedSampler in code:

from recipe.vla.workers.env import create_task_balanced_sampler

sampler = create_task_balanced_sampler(
    dataset=train_dataset,
    batch_size=32,
    max_per_task=16,      # <= GROUP_SIZE
    stage_num=2,          # Match pipeline stages
    seed=42,
)
# Note: Sampler's interleaving is tightly coupled with 
# EnvWorkerServer.reset_envs_to_state_ids() stage assignment

Design & Code Changes

Architecture Diagram:

┌─────────────────────────────────────────────────────────────────────────────┐
│                           Ray Cluster                                        │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐  │
│  │                    TRAIN NODE (Gen - Model Inference)                  │  │
│  │  ┌──────────────┐         ┌─────────────────────────────┐              │  │
│  │  │  VeRL        │         │      EnvWorkerServer        │              │  │
│  │  │  Framework   │────────▶│   (Lightweight Coordinator) │              │  │
│  │  │  + EnvLoop   │         │   traj_key → env mapping    │              │  │
│  │  └──────────────┘         └──────────────┬──────────────┘              │  │
│  └───────────────────────────────────────────┼────────────────────────────┘  │
│                                              │                               │
│                    ┌─────────────────────────┼─────────────────────────┐     │
│                    │  IsaacServerManager     │                         │     │
│                    │  - Task → Server routing                          │     │
│                    │  - Batched step/reset                             │     │
│                    └─────────────────────────┼─────────────────────────┘     │
│                                              │ Ray Actor Calls               │
│  ════════════════════════════════════════════╪═══════════════════════════    │
│  ┌───────────────────────────────────────────┼───────────────────────────┐   │
│  │                 SIM NODES (Multi-Node Support)                        │   │
│  │                                           │                           │   │
│  │   Stage 0 Servers          Stage 1 Servers                            │   │
│  │   ┌─────────────────┐      ┌─────────────────┐                        │   │
│  │   │ IsaacServer 0   │      │ IsaacServer 0   │  (GPU 0, time-shared)  │   │
│  │   │ Tasks: 0-1      │      │ Tasks: 0-1      │                        │   │
│  │   ├─────────────────┤      ├─────────────────┤                        │   │
│  │   │ IsaacServer 1   │      │ IsaacServer 1   │  (GPU 1, time-shared)  │   │
│  │   │ Tasks: 2-3      │      │ Tasks: 2-3      │                        │   │
│  │   ├─────────────────┤      ├─────────────────┤                        │   │
│  │   │ ...             │      │ ...             │                        │   │
│  │   └─────────────────┘      └─────────────────┘                        │   │
│  └───────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

Design & Code Changes

File Changes:

File	Description
`recipe/vla/isaac_server/isaac_server.py`	IsaacServer - Ray actor wrapping Isaac Lab environment
`recipe/vla/isaac_server/isaac_server_manager.py`	IsaacServerManager - manages servers across stages and GPUs
`recipe/vla/workers/env/env_worker_server.py`	EnvWorkerServer - lightweight coordinator with traj_key routing
`recipe/vla/workers/env/utils.py`	TaskBalancedSampler for per-stage task balancing
`recipe/vla/env_loop.py`	EnvLoop with stage-aware traj_key passing
`recipe/vla/config/rob_ppo_trainer.yaml`	Configuration for Isaac server mode
`recipe/vla/run_simpleVLA_isaac_disagg_server.sh`	Launch script for Ray-based Isaac server mode

Key Design Decisions:

Ray Actor Architecture: Isaac servers are Ray actors, enabling unified resource management across train and sim nodes. No manual server startup needed.
Multi-Node Sim Support: Sim nodes join Ray cluster with custom resource label (sim), allowing IsaacServers to be scheduled to appropriate nodes.
Traj-Env 1:1 Mapping: Each trajectory maps to exactly one sim env via traj_key, enabling flexible env deployment without group constraints.
Stage Isolation: Each pipeline stage has its own set of servers, physically isolated. Stages time-share GPUs (e.g., 2 stages → 0.5 GPU/server).
Coupled Stage Assignment: traj_idx % stage_num logic in reset_envs_to_state_ids() MUST match TaskBalancedSampler's interleaving.

Data Flow Example

batch_size=64, stage_num=2, num_tasks=10, group_size=16, servers=10

TaskBalancedSampler produces: [t0, t1, t2, t3, ..., t63] (interleaved by stage)
                               ↓    ↓   ↓   ↓
Stage assignment:              S0   S1  S0  S1  ...

Stage 0 gets: trajs [0,2,4,...,62] = 32 trajs → 32 sim envs
Stage 1 gets: trajs [1,3,5,...,63] = 32 trajs → 32 sim envs

Each traj gets a traj_key:
  traj_key="a1b2c3d4" → {env_index=5, task_id=3, stage_id=0, server_rank=0}

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: Isaac Lab requires GPU simulation environment which is not available in standard CI.
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

gemini-code-assist

Code Review

This pull request introduces a significant new feature, the Isaac Server Mode, which decouples simulation from the training loop. The architecture is well-designed, using ZMQ for communication and a multi-group server setup to support pipeline parallelism. The code is generally well-structured, with clear separation of concerns between the client, server, and env loop logic. However, I've found a few critical issues related to handling edge cases, specifically when a batch contains zero environments or trajectories. These cases can lead to unhandled exceptions and crash the application. Addressing these will make the implementation much more robust.

recipe/vla/env_loop_server.py

recipe/vla/workers/env/env_worker_server.py

recipe/vla/isaac_server/server.py

CLAassistant · 2025-12-17T16:41:31Z

All committers have signed the CLA.

chenchaoxu7575 · 2025-12-23T10:07:41Z

@chenhaiq @HanlinDu
Please review this PR.

- Add IsaacServer and IsaacServerManager for Ray-based Isaac Lab simulation - Add EnvWorkerServer as lightweight adapter for Isaac server mode - Add TaskBalancedSampler to ensure per-task env capacity is not exceeded - Update env_loop to support pipeline stages with traj-env mapping - Add run script for isaac server mode (run_simpleVLA_isaac_disagg_server.sh)

Upstream changed compute_log_prob return type from tuple to dict in verl-project#4678. Update RobDataParallelPPOActor.compute_log_prob to match the new interface.

- Remove redundant camera_height/camera_width from config (use init_params) - Update shell script to use init_params.camera_heights/widths - Rename USE_RAY_ACTORS to ISAAC_SERVER_MODE for consistency - Update env_worker_server to read camera config from init_params - Add isaac_server_mode check in env_loop for 1:1 traj-env mapping - Restore code order in rob_ray_trainer

chenchaoxu7575 · 2026-01-08T13:07:29Z

Isaac Servers are managed by Ray now.
For those interested in TCP/ZMQ Isaac Server implementation(before migrating to Ray Actor), please refer to the backup branch up to commit d455059: https://github.com/chenchaoxu7575/verl/commits/vla-intg-isaac-server-multitask-backup/

chenhaiq

Can you add some unit tests? for example: 1. integration for EnvLoop+Isaac

chenhaiq · 2025-12-23T11:16:59Z

recipe/vla/env_loop_server.py

+        reset_results = ray.get(reset_future[0])
+
+        # Debug: print reset_results structure
+        print(f"[DEBUG reset_results] type: {type(reset_results)}", flush=True)


please use logger.debug

chenhaiq · 2026-01-19T07:11:02Z

verl/experimental/vla/isaac_server/isaac_server.py

+        logger.info(f"[Stage {self.stage_id} Actor {self.actor_rank}] Initializing Isaac environment: {self.env_id}")
+
+        # Detect GPU
+        num_gpus = torch.cuda.device_count()


AssertionError: file /home/runner/work/verl/verl/verl/experimental/vla/isaac_server/isaac_server.py contains .cuda/"cuda"/"nccl" usage, please use api in verl/utils/device.py directly.

chenhaiq · 2026-01-19T07:17:28Z

verl/experimental/vla/isaac_server/isaac_server.py

+
+    # Use print to ensure visibility in Ray logs
+    cleared_msg = " (cleared)" if clear_cache else ""
+    print(f"[Stage {stage_id} Rank {server_rank}] Cache directories configured{cleared_msg}:", flush=True)


please use logger instead of print

chenhaiq · 2026-01-19T07:43:01Z

verl/experimental/vla/config/rob_ppo_trainer.yaml

      video_base_dir: /tmp/videos
    num_envs: 16
    seed: 42
    task_suite_name: libero_10


please add into verl/experimental/vla/readme.md about how to config isaas server mode

chenhaiq · 2026-01-19T07:46:43Z

verl/experimental/vla/isaac_server/isaac_server.py

+                if is_last_chunk:
+                    self.env.unwrapped.cfg.sim.render_interval = original_render_interval
+                else:
+                    self.env.unwrapped.cfg.sim.render_interval = 999999


please add comment about why it is 999999

chenhaiq · 2026-01-19T07:51:20Z

verl/experimental/vla/workers/env/utils.py

+
+        # When stage_num > 1, each stage gets batch_size/stage_num samples
+        # and each stage has its own max_per_task constraint
+        self.samples_per_stage = batch_size // stage_num


please assert batch_size % stage_num == 0

chenhaiq · 2026-01-19T10:06:25Z

verl/experimental/vla/run_simpleVLA_isaac_disagg_server.sh

+
+SAVE_VIDEO=False
+
+export PYTHONRECURSIONLIMIT=10000


why we need PYTHONRECURSIONLIMIT?

HanlinDu · 2026-01-19T10:44:44Z

verl/experimental/vla/env_loop.py

+            trajectory_chunks = data_proto.chunk(self.total_trajs)
+        else:
+            # Local mode: each trajectory has num_envs_per_worker envs
+            num_trajectories = self.total_trajs // self.num_envs_per_worker


Maybe we should add an assertion here to check for divisibility. Currently, the TensorDict.chunk() method used by DataProto does not perform such a check and simply uses integer division.

HanlinDu · 2026-01-19T11:16:44Z

verl/experimental/vla/isaac_server/isaac_server.py

+        logger.info(f"[Stage {self.stage_id} Actor {self.actor_rank}] Visible GPUs: {num_gpus}, using {self.device}")
+
+        # Import Isaac Lab components - follow IsaacEnv pattern exactly
+        import gymnasium as gym


Please place these imports together at the beginning of the file, unless they are conditional imports.

- Add module-level comments explaining Isaac Sim import order requirements - Import torch inside methods after AppLauncher initialization (cached, no overhead) - Replace print() with logger calls in isaac_server and env_worker_server - Add divisibility checks for isaac_server_mode in env_loop and utils - Fix camera parameter retrieval from init_params in env_worker_server - Update PYTHONRECURSIONLIMIT comment with TODO for verification - Simplify render_interval logic using decimation parameter - Correct variable naming from 'chunk' to 'action' in _handle_chunk_step

gemini-code-assist bot reviewed Dec 17, 2025

View reviewed changes

recipe/vla/env_loop_server.py Outdated Show resolved Hide resolved

recipe/vla/workers/env/env_worker_server.py Outdated Show resolved Hide resolved

recipe/vla/isaac_server/server.py Outdated Show resolved Hide resolved

chenchaoxu7575 force-pushed the vla-intg-isaac-server-multitask branch from ea1e8d3 to 4e86cb7 Compare December 22, 2025 07:16

chenchaoxu7575 changed the title ~~[WIP][recipe, VLA] feat: support isaac server mode for multitask libero config~~ [recipe, VLA] feat: support isaac server mode for multitask libero config Dec 23, 2025

chenchaoxu7575 marked this pull request as ready for review December 23, 2025 10:06

chenchaoxu7575 force-pushed the vla-intg-isaac-server-multitask branch from 6205f58 to aa06b72 Compare January 8, 2026 10:33

chenchaoxu7575 added 2 commits January 8, 2026 20:57

fix(vla): adapt compute_log_prob to return dict format

17e1dd3

Upstream changed compute_log_prob return type from tuple to dict in verl-project#4678. Update RobDataParallelPPOActor.compute_log_prob to match the new interface.

chenhaiq self-requested a review January 19, 2026 07:38

chenhaiq reviewed Jan 19, 2026

View reviewed changes

HanlinDu reviewed Jan 19, 2026

View reviewed changes

[recipe, VLA] feat: support isaac server mode for multitask libero config #4578

Are you sure you want to change the base?

[recipe, VLA] feat: support isaac server mode for multitask libero config #4578

Conversation

chenchaoxu7575 commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test

API and Usage Example

Design & Code Changes

Design & Code Changes

Data Flow Example

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CLAassistant commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenchaoxu7575 commented Dec 23, 2025

Uh oh!

chenchaoxu7575 commented Jan 8, 2026

Uh oh!

chenhaiq left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chenchaoxu7575 commented Dec 17, 2025 •

edited

Loading

CLAassistant commented Dec 17, 2025 •

edited

Loading