-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
When omp >= 2, this error comes out stochastically. This error only happens when SSO is not used. It is related with all_reduce in _unscale_main_grads_and_check_for_nan() in DeepSpeedFloat16Optimizer.
The erroring script configuration was this.
===========Script Configuration===========
JOB_TYPE=mobius
JOB_NAME=llama2
HOSTS=v01:4
NSYS_ENABLE=NO
MODEL_SIZE=19
MBS=1
GBS=8
TRAIN_ITER=10
LOG_ITER=5(skip0=YES)
EVAL_ITER=0
SPIRAL_STAGE_OPTIMIZER=NO(omp=2,pool=1)
SPIRAL_ACTV_P2P=YES
SPIRAL_CROSS_MAPPING=NO
SPIRAL_SYNC_CKPT_COMMUNICATION=NO
SPIRAL_FWD=2
SPIRAL_BWD=2
INTERLEAVE_VIRTUAL_SIZE=2
===========================================
And the error is this
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/s6/junyeol/spipe/Megatron-LM-mcrl/pretrain_gpt.py", line 116, in <module>
[rank0]: pretrain(train_valid_test_datasets_provider,
[rank0]: File "/home/s6/junyeol/spipe/Megatron-LM-mcrl/megatron/training.py", line 212, in pretrain
[rank0]: iteration = train(forward_step_func,
[rank0]: File "/home/s6/junyeol/spipe/Megatron-LM-mcrl/megatron/training.py", line 1247, in train
[rank0]: train_step(forward_step_func,
[rank0]: File "/home/s6/junyeol/spipe/Megatron-LM-mcrl/megatron/training.py", line 892, in train_step
[rank0]: update_successful, grad_norm, num_zeros_in_grad = optimizer.step(args, timers)
[rank0]: File "/home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/s6/junyeol/spipe/Megatron-LM-mcrl/megatron/optimizer/optimizer.py", line 422, in step
[rank0]: found_inf_flag = self._unscale_main_grads_and_check_for_nan()
[rank0]: File "/home/s6/junyeol/spipe/Megatron-LM-mcrl/megatron/spiral/optimizer/optimizer.py", line 146, in _unscale_main_grads_and_check_for_nan
[rank0]: torch.distributed.all_reduce(cuda_found_inf,
[rank0]: File "/home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank0]: work = group.allreduce([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1724789116784/work/torch/csrc/distributed/c10d/NCCLUtils.cpp:76, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Failed to CUDA calloc async 72 bytes
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels