ActorDiedError during GRPO training with train_mcore_sglang_qwen3_next_grpo.sh (Ray worker exits with SYSTEM_ERROR / connection error code 2)

# Describe the bug

While training using the provided script scripts/mcore_sglang/train_mcore_sglang_qwen3_next_grpo.sh, the run crashes with a Ray ActorDiedError. The error indicates the worker process died with SYSTEM_ERROR and “connection error code 2. End of file.” The message lists possible causes (OOM, ray stop --force, or crash like SIGSEGV), but none were explicitly triggered by me.

# Full stack trace / logs
`
Traceback (most recent call last):
  File "/pai-mega/ChatLearn/chatlearn/entrypoint.py", line 121, in run
    self._run_algorithm(algo_args)
  File "/pai-mega/ChatLearn/chatlearn/entrypoint.py", line 103, in _run_algorithm
    instance.run()
  File "/pai-mega/ChatLearn/chatlearn/algorithm/grpo.py", line 351, in run
    engine.learn()
  File "/pai-mega/ChatLearn/chatlearn/runtime/engine.py", line 445, in learn
    self.setup()
  File "/pai-mega/ChatLearn/chatlearn/runtime/engine.py", line 295, in setup
    executor.update_models(self.remote_models)
  File "/pai-mega/ChatLearn/chatlearn/runtime/executor.py", line 106, in update_models
    dist_model.group_dist_actors_by_dp_rank()
  File "/pai-mega/ChatLearn/chatlearn/runtime/dist_actor.py", line 485, in group_dist_actors_by_dp_rank
    replica.group_dist_actors_by_dp_rank()
  File "/pai-mega/ChatLearn/chatlearn/runtime/dist_actor.py", line 147, in group_dist_actors_by_dp_rank
    self.data_parallel_size = future.get(self.all_actors[0].get_data_parallel_size.remote())
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pai-mega/ChatLearn/chatlearn/utils/future.py", line 90, in get
    data = ray.get(data)
           ^^^^^^^^^^^^^
  File "/workspace/venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/workspace/venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 102, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/workspace/venv/lib/python3.12/site-packages/ray/util/client/api.py", line 42, in get
    return self.worker.get(vals, timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/venv/lib/python3.12/site-packages/ray/util/client/worker.py", line 433, in get
    res = self._get(to_get, op_timeout)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/venv/lib/python3.12/site-packages/ray/util/client/worker.py", line 461, in _get
    raise err
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
    class_name: MegatronPolicyTrainer
    actor_id: ZZZ
    pid: YYY
    name: ref_policy_replica
    namespace: ref_policy
    ip: XXX
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
`

Launch with slurm
`
scripts/mcore_sglang/train_mcore_sglang_qwen3_next_grpo.sh
`

# What I expected

Training to start and progress past engine setup without Ray actors dying.

# What actually happened

During engine.setup() → executor.update_models() → group_dist_actors_by_dp_rank(), a Ray actor (MegatronPolicyTrainer) died and the job aborted with ActorDiedError.

# Things I checked

・No manual ray stop --force.
・No deliberate kill signals sent.
・System logs didn’t show an obvious OOM kill at the time.
・GPU memory utilization was high near crash (cannot confirm peak).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ActorDiedError during GRPO training with train_mcore_sglang_qwen3_next_grpo.sh (Ray worker exits with SYSTEM_ERROR / connection error code 2) #392

Describe the bug

Full stack trace / logs

What I expected

What actually happened

Things I checked

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ActorDiedError during GRPO training with train_mcore_sglang_qwen3_next_grpo.sh (Ray worker exits with SYSTEM_ERROR / connection error code 2) #392

Description

Describe the bug

Full stack trace / logs

What I expected

What actually happened

Things I checked

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions