Skip to content

Facing issues when trying to reproduce the same run with 4xH200s #27

@harisarang

Description

@harisarang

Description

I followed the installation instructions in the repository's README.md
and attempted to run the SDPO generalization experiment. However, the
training process fails during model initialization.

Environment

  • Python version: 3.12\
  • CUDA version: 12.8\
  • PyTorch version: 2.8.0\
  • GPU: 2 × NVIDIA H200

Steps to Reproduce

  1. Follow the installation instructions from README.md.
  2. Run the experiment script:
bash experiments/generalization/run_sdpo_all.sh
  1. The process fails during model initialization.

Error Logs

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
(TaskRunner pid=37717) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::WorkerDict.actor_rollout_ref_init_model() (pid=38309, ip=172.17.0.3, actor_id=d6503e2665e2d40fc6795be001000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7114d32adb20>)
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(TaskRunner pid=37717)     return self.__get_result()
(TaskRunner pid=37717)            ^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(TaskRunner pid=37717)     raise self._exception
(TaskRunner pid=37717)            ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/workspace/verl/single_controller/ray/base.py", line 844, in func
(TaskRunner pid=37717)     return getattr(self.worker_dict[key], name)(*args, **kwargs)
(TaskRunner pid=37717)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/workspace/verl/single_controller/base/decorator.py", line 462, in inner
(TaskRunner pid=37717)     return func(*args, **kwargs)
(TaskRunner pid=37717)            ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/workspace/verl/utils/transferqueue_utils.py", line 314, in dummy_inner
(TaskRunner pid=37717)     output = func(*args, **kwargs)
(TaskRunner pid=37717)              ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/workspace/verl/workers/fsdp_workers.py", line 812, in init_model
(TaskRunner pid=37717)     ) = self._build_model_optimizer(
(TaskRunner pid=37717)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/workspace/verl/workers/fsdp_workers.py", line 400, in _build_model_optimizer
(TaskRunner pid=37717)     actor_module = actor_module_class.from_pretrained(
(TaskRunner pid=37717)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/transformers/models/auto/auto_factory.py", line 604, in from_pretrained
(TaskRunner pid=37717)     return model_class.from_pretrained(
(TaskRunner pid=37717)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/transformers/modeling_utils.py", line 288, in _wrapper
(TaskRunner pid=37717)     return func(*args, **kwargs)
(TaskRunner pid=37717)            ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/transformers/modeling_utils.py", line 5103, in from_pretrained
(TaskRunner pid=37717)     model = cls(config, *model_args, **model_kwargs)
(TaskRunner pid=37717)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/transformers/models/qwen3/modeling_qwen3.py", line 435, in __init__
(TaskRunner pid=37717)     super().__init__(config)
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/transformers/modeling_utils.py", line 2197, in __init__
(TaskRunner pid=37717)     self.config._attn_implementation_internal = self._check_and_adjust_attn_implementation(
(TaskRunner pid=37717)                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/transformers/modeling_utils.py", line 2812, in _check_and_adjust_attn_implementation
(TaskRunner pid=37717)     lazy_import_flash_attention(applicable_attn_implementation)
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 136, in lazy_import_flash_attention
(TaskRunner pid=37717)     _flash_fn, _flash_varlen_fn, _pad_fn, _unpad_fn = _lazy_imports(implementation)
(TaskRunner pid=37717)                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 83, in _lazy_imports
(TaskRunner pid=37717)     from flash_attn import flash_attn_func, flash_attn_varlen_func
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/flash_attn/__init__.py", line 3, in <module>
(TaskRunner pid=37717)     from flash_attn.flash_attn_interface import (
(TaskRunner pid=37717)   File "/venv/main/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 15, in <module>
(TaskRunner pid=37717)     import flash_attn_2_cuda as flash_attn_gpu
(TaskRunner pid=37717) ImportError: /venv/main/lib/python3.12/site-packages/flash_attn_2_cuda.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_jb
(WorkerDict pid=38309) /workspace/verl/utils/tokenizer.py:107: UserWarning: Failed to create processor: Unsupported processor type: Qwen2TokenizerFast. This may affect multimodal processing [repeated 3x across cluster]
(WorkerDict pid=38309)   warnings.warn(f"Failed to create processor: {e}. This may affect multimodal processing", stacklevel=1) [repeated 3x across cluster]
(WorkerDict pid=38309) `torch_dtype` is deprecated! Use `dtype` instead! [repeated 3x across cluster]

Any guidance would be appreciated. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions