-
Notifications
You must be signed in to change notification settings - Fork 140
Description
Hello,
I've been trying to run distributed Galois for quite some time. I've tried running all the provided apps and have been encountering a segmentation fault.
Command used :
./sssp-pull --startNode=0 $graphPath
Error observed:
[0] Master distribution time : 0.239983 seconds to read 168 bytes in 20 seeks (0.00070005 MBPS)
[0] Starting graph reading.
[0] Reading graph complete.
[0] Edge inspection time: 0.246308 seconds to read 148615096 bytes (603.371 MBPS)
Loading edge-data while creating edges
[0] Edge loading time: 0.529808 seconds to read 271105352 bytes (511.705 MBPS)
[0] Graph construction complete.
[0] InitializeGraph::go called
[0] SSSP::go run 0 called
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6dc8604 in PMPI_Iallreduce () from /lfs/sware/openmpi411/lib/libmpi.so.40
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7.x86_64 numactl-libs-2.0.9-7.el7.x86_64
Here are some of my observations from plugging the binary onto gdb.
The segfault has been occuring in the library libdist/libgalois_dist_async.a on a PMPI_Iallreduce call.
After observing the segfault, I opened gdb and noticed a constant address that is outside the process's memory bounds being accessed.
This address was being pushed into r9 in the preamble to calling MPI_IallReduce and was moved into rbp. This address does not seem to be accessible ever.
As per SystemV ABI, r9 would be the sixth argument being passed into a function, which for our case is MPI_COMM_WORLD.
This could happen if MPI_COMM_WORLD was not initialised, which would indicate the code flow lacking an MPI_Init().
Also, gdb could only set a future breakpoint in MPI_Init, and the segfault int MPI_Iallreduce before MPI_Init breakpoint. I don't notice any boost_mpi libraries.
This is the preamble to PMPI_Iallreduce call:
0x478441 <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+145>: mov $0x44000000,%r9d
0x478447 <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+151>: movb $0x0,-0xe8(%rbp)
0x47844e <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+158>: mov $0x58000001,%r8d
0x478454 <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+164>: mov $0x4c000808,%ecx
0x478459 <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+169>: mov $0x1,%edx
0x47845e <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+174>: mov %rsi,-0x3a8(%rbp)
0x478465 <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+181>: mov %rdi,-0x3a0(%rbp)
0x47846c <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+188>: mov %rax,-0x388(%rbp)
0x478473 <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+195>: vmovdqa %xmm3,-0x100(%rbp)
0x47847b <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+203>: push %rax
0x47847c <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+204>: callq 0x413230 <MPI_Iallreduce@plt>
This is where the segfault is happening.
0x00007ffff6dc85c8 <+24>: mov %r9,%rbp
0x00007ffff6dc85cb <+27>: push %rbx
0x00007ffff6dc85cc <+28>: sub $0x28,%rsp
0x00007ffff6dc85d0 <+32>: mov 0x29e041(%rip),%rax # 0x7ffff7066618
0x00007ffff6dc85d7 <+39>: mov 0x60(%rsp),%rbx
0x00007ffff6dc85dc <+44>: cmpb $0x0,(%rax)
0x00007ffff6dc85df <+47>: je 0x7ffff6dc8648 <PMPI_Iallreduce+152>
0x00007ffff6dc85e1 <+49>: mov 0x29e8d0(%rip),%rax # 0x7ffff7066eb8
0x00007ffff6dc85e8 <+56>: mov (%rax),%eax
0x00007ffff6dc85ea <+58>: sub $0x2,%eax
0x00007ffff6dc85ed <+61>: cmp $0x2,%eax
0x00007ffff6dc85f0 <+64>: ja 0x7ffff6dc8750 <PMPI_Iallreduce+416>
0x00007ffff6dc85f6 <+70>: test %rbp,%rbp
0x00007ffff6dc85f9 <+73>: je 0x7ffff6dc8612 <PMPI_Iallreduce+98>
0x00007ffff6dc85fb <+75>: cmp 0x29e1c6(%rip),%rbp # 0x7ffff70667c8
0x00007ffff6dc8602 <+82>: je 0x7ffff6dc8612 <PMPI_Iallreduce+98>
=> 0x00007ffff6dc8604 <+84>: mov 0xe8(%rbp),%eax
Note the address 0x44000000. This address seems not accessible.
(gdb) p/x *0x44000000
Cannot access memory at address 0x44000000