Debug ERROR in DeepSpeedFloat16Optimizer 

When omp >= 2, this error comes out stochastically. This error only happens when SSO is not used. It is related with all_reduce in _unscale_main_grads_and_check_for_nan() in DeepSpeedFloat16Optimizer.

The erroring script configuration was this.
```
===========Script Configuration===========
JOB_TYPE=mobius
JOB_NAME=llama2
HOSTS=v01:4
NSYS_ENABLE=NO
MODEL_SIZE=19
MBS=1
GBS=8
TRAIN_ITER=10
LOG_ITER=5(skip0=YES)
EVAL_ITER=0
SPIRAL_STAGE_OPTIMIZER=NO(omp=2,pool=1)
SPIRAL_ACTV_P2P=YES
SPIRAL_CROSS_MAPPING=NO
SPIRAL_SYNC_CKPT_COMMUNICATION=NO
SPIRAL_FWD=2
SPIRAL_BWD=2
INTERLEAVE_VIRTUAL_SIZE=2
===========================================
```

And the error is this


```
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/s6/junyeol/spipe/Megatron-LM-mcrl/pretrain_gpt.py", line 116, in <module>
[rank0]:     pretrain(train_valid_test_datasets_provider,
[rank0]:   File "/home/s6/junyeol/spipe/Megatron-LM-mcrl/megatron/training.py", line 212, in pretrain
[rank0]:     iteration = train(forward_step_func,
[rank0]:   File "/home/s6/junyeol/spipe/Megatron-LM-mcrl/megatron/training.py", line 1247, in train
[rank0]:     train_step(forward_step_func,
[rank0]:   File "/home/s6/junyeol/spipe/Megatron-LM-mcrl/megatron/training.py", line 892, in train_step
[rank0]:     update_successful, grad_norm, num_zeros_in_grad = optimizer.step(args, timers)
[rank0]:   File "/home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/s6/junyeol/spipe/Megatron-LM-mcrl/megatron/optimizer/optimizer.py", line 422, in step
[rank0]:     found_inf_flag = self._unscale_main_grads_and_check_for_nan()
[rank0]:   File "/home/s6/junyeol/spipe/Megatron-LM-mcrl/megatron/spiral/optimizer/optimizer.py", line 146, in _unscale_main_grads_and_check_for_nan
[rank0]:     torch.distributed.all_reduce(cuda_found_inf,
[rank0]:   File "/home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank0]:     work = group.allreduce([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1724789116784/work/torch/csrc/distributed/c10d/NCCLUtils.cpp:76, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Failed to CUDA calloc async 72 bytes
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debug ERROR in DeepSpeedFloat16Optimizer #60

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Debug ERROR in DeepSpeedFloat16Optimizer #60

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions