-
Notifications
You must be signed in to change notification settings - Fork 39
Description
Describe the bug
While training using the provided script scripts/mcore_sglang/train_mcore_sglang_qwen3_next_grpo.sh, the run crashes with a Ray ActorDiedError. The error indicates the worker process died with SYSTEM_ERROR and “connection error code 2. End of file.” The message lists possible causes (OOM, ray stop --force, or crash like SIGSEGV), but none were explicitly triggered by me.
Full stack trace / logs
Traceback (most recent call last): File "/pai-mega/ChatLearn/chatlearn/entrypoint.py", line 121, in run self._run_algorithm(algo_args) File "/pai-mega/ChatLearn/chatlearn/entrypoint.py", line 103, in _run_algorithm instance.run() File "/pai-mega/ChatLearn/chatlearn/algorithm/grpo.py", line 351, in run engine.learn() File "/pai-mega/ChatLearn/chatlearn/runtime/engine.py", line 445, in learn self.setup() File "/pai-mega/ChatLearn/chatlearn/runtime/engine.py", line 295, in setup executor.update_models(self.remote_models) File "/pai-mega/ChatLearn/chatlearn/runtime/executor.py", line 106, in update_models dist_model.group_dist_actors_by_dp_rank() File "/pai-mega/ChatLearn/chatlearn/runtime/dist_actor.py", line 485, in group_dist_actors_by_dp_rank replica.group_dist_actors_by_dp_rank() File "/pai-mega/ChatLearn/chatlearn/runtime/dist_actor.py", line 147, in group_dist_actors_by_dp_rank self.data_parallel_size = future.get(self.all_actors[0].get_data_parallel_size.remote()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/pai-mega/ChatLearn/chatlearn/utils/future.py", line 90, in get data = ray.get(data) ^^^^^^^^^^^^^ File "/workspace/venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) File "/workspace/venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 102, in wrapper return getattr(ray, func.__name__)(*args, **kwargs) File "/workspace/venv/lib/python3.12/site-packages/ray/util/client/api.py", line 42, in get return self.worker.get(vals, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/venv/lib/python3.12/site-packages/ray/util/client/worker.py", line 433, in get res = self._get(to_get, op_timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/venv/lib/python3.12/site-packages/ray/util/client/worker.py", line 461, in _get raise err ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. class_name: MegatronPolicyTrainer actor_id: ZZZ pid: YYY name: ref_policy_replica namespace: ref_policy ip: XXX The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Launch with slurm
scripts/mcore_sglang/train_mcore_sglang_qwen3_next_grpo.sh
What I expected
Training to start and progress past engine setup without Ray actors dying.
What actually happened
During engine.setup() → executor.update_models() → group_dist_actors_by_dp_rank(), a Ray actor (MegatronPolicyTrainer) died and the job aborted with ActorDiedError.
Things I checked
・No manual ray stop --force.
・No deliberate kill signals sent.
・System logs didn’t show an obvious OOM kill at the time.
・GPU memory utilization was high near crash (cannot confirm peak).