Skip to content

RuntimeError: CUDA error: an illegal memory access was encountered #156

@jpainam

Description

@jpainam

Hi
Has anyome managed to train multi-gpus? I'm using this command
python train_3d.py --outdir=./outdir --data=shapenet_get3d/img/03790512 --camera_path shapenet_get3d/camera --gpus=8 --batch=32 --gamma=40 --data_camera_mode shapenet_motorbike --dmtet_scale 1.0 --use_shapenet_split 1 --one_3d_generator 0 --img_res=256 --kimg=200 --workers 1

Constructing networks...
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:158, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
Setting up augmentation...
Distributing across 8 GPUs...
Traceback (most recent call last):
  File "train_3d.py", line 339, in <module>
    main()  # pylint: disable=no-value-for-parameter
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "train_3d.py", line 333, in main
    launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
  File "train_3d.py", line 107, in launch_training
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus)
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "~/GET3D/train_3d.py", line 51, in subprocess_fn
    training_loop_3d.training_loop(rank=rank, **c)
  File "~/GET3D/training/training_loop_3d.py", line 159, in training_loop
    G = dnnlib.util.construct_class_by_name(**G_kwargs, **common_kwargs).train().requires_grad_(False).to(
  File "~/GET3D/dnnlib/util.py", line 306, in construct_class_by_name
    return call_func_by_name(*args, func_name=class_name, **kwargs)
  File "~/GET3D/dnnlib/util.py", line 301, in call_func_by_name
    return func_obj(*args, **kwargs)
  File "~/GET3D/torch_utils/persistence.py", line 105, in __init__
    super().__init__(*args, **kwargs)
  File "~/GET3D/training/networks_get3d.py", line 599, in __init__
    self.synthesis = DMTETSynthesisNetwork(
  File "~/GET3D/torch_utils/persistence.py", line 105, in __init__
    super().__init__(*args, **kwargs)
  File "~/GET3D/training/networks_get3d.py", line 81, in __init__
    self.dmtet_geometry = DMTetGeometry(
  File "~/GET3D/uni_rep/rep_3d/dmtet.py", line 423, in __init__
    all_edges_sorted = torch.sort(all_edges, dim=1)[0]
RuntimeError: CUDA error: an illegal memory access was encountered

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions