Skip to content

CUDA Out of Memory After 160 Epochs #17

@ashwin296

Description

@ashwin296

Hi there, thanks for open-source the code!

Summary of the Issue:
I encountered a CUDA Out of Memory error while running bash main_grpo.sh with the code-r1-2k-leetcode2k-taco dataset after 160 successful epochs. The training was stable until then, but suddenly crashed during loss.backward(). Do you have any insights on why this might happen after many stable epochs?

Command:

bash main_grpo.sh

The only change that I made to main_grpo.sh is to change the DATASET=code-r1-12k to DATASET=code-r1-2k-leetcode2k-taco.

Error Message:

Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=data/code-r1-2k-leetcode2k-taco/train.parquet', 'data.val_files=data/code-r1-2k-leetcode2k-taco/test.parquet', 'data.train_batch_size=16', 'data.max_prompt_length=2048', 'data.max_response_length=4096', 'actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct-1M', 'actor_rollout_ref.actor.optim.lr=5e-7', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.model.enable_gradient_checkpointing=False', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=256', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.5', 'actor_rollout_ref.rollout.n=16', 'actor_rollout_ref.ref.log_prob_micro_batch_size=256', 'actor_rollout_ref.ref.fsdp_config.param_offload=False', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[wandb]', 'trainer.project_name=code-r1', 'trainer.experiment_name=code-r1-2k-leetcode2k-taco-grpo', 'trainer.nnodes=1', 'trainer.default_local_dir=./models/code-r1-2k-leetcode2k-taco-grpo', 'trainer.n_gpus_per_node=8', 'trainer.save_freq=64', 'trainer.test_freq=16', 'trainer.total_epochs=8', 'reward_model.reward_manager=prime']
Traceback (most recent call last):
  File "/home/***/github/code-r1/verl/trainer/main_ppo.py", line 25, in main
    run_ppo(config)
  File "/home/***/github/code-r1/verl/trainer/main_ppo.py", line 33, in run_ppo
    ray.get(main_task.remote(config, compute_score))
  File "/opt/conda/envs/code/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/opt/conda/envs/code/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/envs/code/lib/python3.10/site-packages/ray/_private/worker.py", line 2755, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/opt/conda/envs/code/lib/python3.10/site-packages/ray/_private/worker.py", line 906, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::main_task() (pid=399789, ip=10.128.0.9)
  File "/home/***/github/code-r1/verl/trainer/main_ppo.py", line 128, in main_task
    trainer.fit()
  File "/home/***/github/code-r1/verl/trainer/ppo/ray_trainer.py", line 1004, in fit
    actor_output = self.actor_rollout_wg.update_actor(batch)
  File "/home/***/github/code-r1/verl/single_controller/ray/base.py", line 42, in func
    output = ray.get(output)
ray.exceptions.RayTaskError(OutOfMemoryError): ray::WorkerDict.actor_rollout_update_actor() (pid=400862, ip=10.128.0.9, actor_id=87523c02ec964b128fbc710001000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7edde69d7850>)
  File "/home/***/github/code-r1/verl/single_controller/ray/base.py", line 399, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
  File "/home/***/github/code-r1/verl/single_controller/base/decorator.py", line 404, in inner
    return func(*args, **kwargs)
  File "/home/***/github/code-r1/verl/workers/fsdp_workers.py", line 435, in update_actor
    metrics = self.actor.update_policy(data=data)
  File "/home/***/github/code-r1/verl/workers/actor/dp_actor.py", line 313, in update_policy
    loss.backward()
  File "/opt/conda/envs/code/lib/python3.10/site-packages/torch/_tensor.py", line 521, in backward
    torch.autograd.backward(
  File "/opt/conda/envs/code/lib/python3.10/site-packages/torch/autograd/__init__.py", line 289, in backward
    _engine_run_backward(
  File "/opt/conda/envs/code/lib/python3.10/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.16 GiB. GPU 0 has a total capacity of 79.10 GiB of which 1.63 GiB is free. Including non-PyTorch memory, this process has 77.42 GiB memory in use. Of the allocated memory 68.16 GiB is allocated by PyTorch, and 3.49 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Additional Information:

  • Hardware: I use 8xH100 with 80G memory.
  • Software: I followed the installation script in general; however, there were issues with the vllm version, and I ended up with using torch==2.4.0 and vllm==0.6.3.

Any suggestions on troubleshooting would be greatly appreciated!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions