Skip to content

Debug ERROR in DeepSpeedFloat16Optimizer  #60

@gajagajago

Description

@gajagajago

When omp >= 2, this error comes out stochastically. This error only happens when SSO is not used. It is related with all_reduce in _unscale_main_grads_and_check_for_nan() in DeepSpeedFloat16Optimizer.

The erroring script configuration was this.

===========Script Configuration===========
JOB_TYPE=mobius
JOB_NAME=llama2
HOSTS=v01:4
NSYS_ENABLE=NO
MODEL_SIZE=19
MBS=1
GBS=8
TRAIN_ITER=10
LOG_ITER=5(skip0=YES)
EVAL_ITER=0
SPIRAL_STAGE_OPTIMIZER=NO(omp=2,pool=1)
SPIRAL_ACTV_P2P=YES
SPIRAL_CROSS_MAPPING=NO
SPIRAL_SYNC_CKPT_COMMUNICATION=NO
SPIRAL_FWD=2
SPIRAL_BWD=2
INTERLEAVE_VIRTUAL_SIZE=2
===========================================

And the error is this

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/s6/junyeol/spipe/Megatron-LM-mcrl/pretrain_gpt.py", line 116, in <module>
[rank0]:     pretrain(train_valid_test_datasets_provider,
[rank0]:   File "/home/s6/junyeol/spipe/Megatron-LM-mcrl/megatron/training.py", line 212, in pretrain
[rank0]:     iteration = train(forward_step_func,
[rank0]:   File "/home/s6/junyeol/spipe/Megatron-LM-mcrl/megatron/training.py", line 1247, in train
[rank0]:     train_step(forward_step_func,
[rank0]:   File "/home/s6/junyeol/spipe/Megatron-LM-mcrl/megatron/training.py", line 892, in train_step
[rank0]:     update_successful, grad_norm, num_zeros_in_grad = optimizer.step(args, timers)
[rank0]:   File "/home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/s6/junyeol/spipe/Megatron-LM-mcrl/megatron/optimizer/optimizer.py", line 422, in step
[rank0]:     found_inf_flag = self._unscale_main_grads_and_check_for_nan()
[rank0]:   File "/home/s6/junyeol/spipe/Megatron-LM-mcrl/megatron/spiral/optimizer/optimizer.py", line 146, in _unscale_main_grads_and_check_for_nan
[rank0]:     torch.distributed.all_reduce(cuda_found_inf,
[rank0]:   File "/home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank0]:     work = group.allreduce([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1724789116784/work/torch/csrc/distributed/c10d/NCCLUtils.cpp:76, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Failed to CUDA calloc async 72 bytes

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions