Skip to content

[c++] fix: fail fast if distributed connection retries exhausted#7138

Open
wagner-austin wants to merge 3 commits intolightgbm-org:masterfrom
wagner-austin:fix-null-linker-crash
Open

[c++] fix: fail fast if distributed connection retries exhausted#7138
wagner-austin wants to merge 3 commits intolightgbm-org:masterfrom
wagner-austin:fix-null-linker-crash

Conversation

@wagner-austin
Copy link
Contributor

Summary

Adds explicit null check after connection retry loop to fail fast with clear error message instead of crashing later with SIGSEGV.

Problem

In linkers_socket.cpp:196-216, if all connection retries fail:

  1. linkers_[out_rank] remains nullptr (initialized at line 56)
  2. No error is raised - code continues silently
  3. Later, Send()/Recv() in linkers.h:245,257 dereference without null check
  4. Result: SIGSEGV (null pointer dereference)

Evidence

Tested by reducing retry count and connecting to unreachable address (192.0.2.1):

Before fix:

[Warning] Connecting to rank 1 failed, waiting for 200 milliseconds
[Warning] Connecting to rank 1 failed, waiting for 260 milliseconds
[Info] Finished initializing network
Segmentation fault (exit code 139)

After fix:

[Warning] Connecting to rank 1 failed, waiting for 200 milliseconds
[Warning] Connecting to rank 1 failed, waiting for 260 milliseconds
[Fatal] Failed to connect to rank 1 after 2 retries
(exit code 127 - normal error exit)

Changes

  • src/network/linkers_socket.cpp: Add null check after retry loop (3 lines)

Test Plan

  • Pre-commit passes
  • Windows build compiles (MSVC)
  • Linux build compiles (GCC in WSL)
  • C++ unit tests pass (31/31)
  • Verified crash before fix (SIGSEGV)
  • Verified clean error after fix

Related

If all connection retries fail during distributed training setup,
linkers_[rank] remains nullptr. Later Send/Recv operations would
dereference nullptr causing SIGSEGV. This adds an explicit check
to fail immediately with a clear error message instead of crashing
later during training.
Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me and a direct raised exception would definitely be preferable to a segfault and might help with some of the issues lightgbm.dask users have reported.

But I'm not familiar enough with this part of the codebase to be totally confident about the implications of raising an exception there.

@shiyu1994 or @guolinke do you have time to review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants