[c++] fix: fail fast if distributed connection retries exhausted#7138
Open
wagner-austin wants to merge 3 commits intolightgbm-org:masterfrom
Open
[c++] fix: fail fast if distributed connection retries exhausted#7138wagner-austin wants to merge 3 commits intolightgbm-org:masterfrom
wagner-austin wants to merge 3 commits intolightgbm-org:masterfrom
Conversation
If all connection retries fail during distributed training setup, linkers_[rank] remains nullptr. Later Send/Recv operations would dereference nullptr causing SIGSEGV. This adds an explicit check to fail immediately with a clear error message instead of crashing later during training.
jameslamb
reviewed
Jan 23, 2026
Member
jameslamb
left a comment
There was a problem hiding this comment.
This makes sense to me and a direct raised exception would definitely be preferable to a segfault and might help with some of the issues lightgbm.dask users have reported.
But I'm not familiar enough with this part of the codebase to be totally confident about the implications of raising an exception there.
@shiyu1994 or @guolinke do you have time to review?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds explicit null check after connection retry loop to fail fast with clear error message instead of crashing later with SIGSEGV.
Problem
In
linkers_socket.cpp:196-216, if all connection retries fail:linkers_[out_rank]remainsnullptr(initialized at line 56)Send()/Recv()inlinkers.h:245,257dereference without null checkEvidence
Tested by reducing retry count and connecting to unreachable address (
192.0.2.1):Before fix:
After fix:
Changes
src/network/linkers_socket.cpp: Add null check after retry loop (3 lines)Test Plan
Related