-
Notifications
You must be signed in to change notification settings - Fork 31
Description
While running the Multi-gpu Pytorch tests, test_all_reduce_coalesced_nccl is failing in pytorch/test/test_c10d_nccl.py. It seems like the error is coming because of inconsistent results from allreduce. The information on the logs is as follows:
171495ffc000000:237471:237471 [0] NCCL INFO Bootstrap : Using eth0:10.1.0.4<0>
171495ffc000000:237471:237471 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
171495ffc000000:237471:237471 [0] NCCL INFO Failed to open libibverbs.so[.1]
171495ffc000000:237471:237471 [0] NCCL INFO NET/Socket : Using [0]eth0:10.1.0.4<0>
171495ffc000000:237471:237471 [0] NCCL INFO Using network Socket
NCCL version 2.12.12.MSCCL.0.1+cuda11.3
171495ffc000000:237471:237535 [0] NCCL INFO init.cc:233 Cuda Host Alloc Size 4 pointer 0x203400000
171495ffc000000:237472:237472 [1] NCCL INFO Bootstrap : Using eth0:10.1.0.4<0>
171495ffc000000:237472:237472 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
171495ffc000000:237472:237472 [1] NCCL INFO Failed to open libibverbs.so[.1]
171495ffc000000:237472:237472 [1] NCCL INFO NET/Socket : Using [0]eth0:10.1.0.4<0>
171495ffc000000:237472:237472 [1] NCCL INFO Using network Socket
171495ffc000000:237472:237536 [1] NCCL INFO init.cc:233 Cuda Host Alloc Size 4 pointer 0x206800000
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_width, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_width, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_width, ignoring
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_width, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/000d3a1f-1dcf-000d-3a1f-1dcf000d3a1f is not a PCI device (vmbus). Attaching to first CPU
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/000d3a1f-1dcf-000d-3a1f-1dcf000d3a1f is not a PCI device (vmbus). Attaching to first CPU
171495ffc000000:237471:237535 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237472:237536 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237472:237536 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237471:237535 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237471:237535 [0] NCCL INFO === System : maxWidth 12.0 totalWidth 12.0 ===
Additional error info:
ERROR:torch.testing._internal.common_distributed:Caught exception:
Traceback (most recent call last):
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_distributed.py", line 601, in run_test
getattr(self, test_name)()
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_distributed.py", line 486, in wrapper
fn()
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_utils.py", line 3098, in wrapper
return func(*args, **kwargs)
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_distributed.py", line 131, in wrapper
return func(*args, **kwargs)
File "/mnt/vss/_work/1/s/test/pytorch/test/distributed/test_c10d_nccl.py", line 2867, in test_all_reduce_coalesced_nccl
self.assertEqual(t, torch.full_like(t, self.world_size * (i + (self.world_size + 1.) / 2.)))
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_utils.py", line 2121, in assertEqual
assert_equal(
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_comparison.py", line 1080, in assert_equal
raise error_metas[0].to_error()
AssertionError: Tensor-likes are not close!
Mismatched elements: 60 / 60 (100.0%)
Greatest absolute difference: 1.0 at index 0 (up to 1e-05 allowed)
Greatest relative difference: 0.3333333333333333 at index 0 (up to 1.3e-06 allowed)
exiting process 1 with exit code: 10