Skip to content

ActorDiedError during GRPO training with train_mcore_sglang_qwen3_next_grpo.sh (Ray worker exits with SYSTEM_ERROR / connection error code 2) #392

@yasu-nishi

Description

@yasu-nishi

Describe the bug

While training using the provided script scripts/mcore_sglang/train_mcore_sglang_qwen3_next_grpo.sh, the run crashes with a Ray ActorDiedError. The error indicates the worker process died with SYSTEM_ERROR and “connection error code 2. End of file.” The message lists possible causes (OOM, ray stop --force, or crash like SIGSEGV), but none were explicitly triggered by me.

Full stack trace / logs

Traceback (most recent call last): File "/pai-mega/ChatLearn/chatlearn/entrypoint.py", line 121, in run self._run_algorithm(algo_args) File "/pai-mega/ChatLearn/chatlearn/entrypoint.py", line 103, in _run_algorithm instance.run() File "/pai-mega/ChatLearn/chatlearn/algorithm/grpo.py", line 351, in run engine.learn() File "/pai-mega/ChatLearn/chatlearn/runtime/engine.py", line 445, in learn self.setup() File "/pai-mega/ChatLearn/chatlearn/runtime/engine.py", line 295, in setup executor.update_models(self.remote_models) File "/pai-mega/ChatLearn/chatlearn/runtime/executor.py", line 106, in update_models dist_model.group_dist_actors_by_dp_rank() File "/pai-mega/ChatLearn/chatlearn/runtime/dist_actor.py", line 485, in group_dist_actors_by_dp_rank replica.group_dist_actors_by_dp_rank() File "/pai-mega/ChatLearn/chatlearn/runtime/dist_actor.py", line 147, in group_dist_actors_by_dp_rank self.data_parallel_size = future.get(self.all_actors[0].get_data_parallel_size.remote()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/pai-mega/ChatLearn/chatlearn/utils/future.py", line 90, in get data = ray.get(data) ^^^^^^^^^^^^^ File "/workspace/venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) File "/workspace/venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 102, in wrapper return getattr(ray, func.__name__)(*args, **kwargs) File "/workspace/venv/lib/python3.12/site-packages/ray/util/client/api.py", line 42, in get return self.worker.get(vals, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/venv/lib/python3.12/site-packages/ray/util/client/worker.py", line 433, in get res = self._get(to_get, op_timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/venv/lib/python3.12/site-packages/ray/util/client/worker.py", line 461, in _get raise err ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. class_name: MegatronPolicyTrainer actor_id: ZZZ pid: YYY name: ref_policy_replica namespace: ref_policy ip: XXX The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Launch with slurm
scripts/mcore_sglang/train_mcore_sglang_qwen3_next_grpo.sh

What I expected

Training to start and progress past engine setup without Ray actors dying.

What actually happened

During engine.setup() → executor.update_models() → group_dist_actors_by_dp_rank(), a Ray actor (MegatronPolicyTrainer) died and the job aborted with ActorDiedError.

Things I checked

・No manual ray stop --force.
・No deliberate kill signals sent.
・System logs didn’t show an obvious OOM kill at the time.
・GPU memory utilization was high near crash (cannot confirm peak).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions