Skip to content

Bug fixes for some resource leaks and access to freed resources, etc#122

Open
ston3lu wants to merge 4 commits intovitalif:masterfrom
ston3lu:ston3lu-patch-0127
Open

Bug fixes for some resource leaks and access to freed resources, etc#122
ston3lu wants to merge 4 commits intovitalif:masterfrom
ston3lu:ston3lu-patch-0127

Conversation

@ston3lu
Copy link
Contributor

@ston3lu ston3lu commented Jan 27, 2026

@ston3lu
Copy link
Contributor Author

ston3lu commented Jan 27, 2026

Hi.
During testing this issue (#120), which caused frequent OSD restarts. Subsequently, I discovered several resource leaks or issues access to freed resources.

Hope this is helpful.

@ston3lu ston3lu changed the title Fixed some bugs encountered during frequent OSD restarts Bug fixes for some resource leaks and access to freed resources, etc Jan 27, 2026
@vitalif
Copy link
Owner

vitalif commented Jan 28, 2026

Hi, thank you very much for your fixes, I'm only doubtful about the "handle missing peer clients" change. Did you actually see crashes at that .at()?

Problem is that just checking by peer_fd isn't really correct because if a peer_fd is freed and closed then the FD number may be reused. Checking the peer by osd_client_t* pointer isn't correct either because it can also be reused - allocators like to reuse chunks of the same size.

So msgr_receive.cpp tries to guarantee that it never calls exec_op() on destroyed clients. Now it even clears postponed operations from the "immediate callback queue" when the client is destroyed...

And if you've seen crashes where the client reference was absent by peer_fd, then it probably means this protection didn't work and that it could also refer to an invalid reused peer_fd as well.

So we'd better find and fix the real bug instead of replacing .at() with .find() != .end().

From looking at the code it seems that this protection may only break when a recovery operation is postponed by recovery_target_sleep_us...

@vitalif
Copy link
Owner

vitalif commented Jan 28, 2026

"Fix peer connection handling by checking for existing entries" is also similar - did you really observe crashes in on_connect_peer? It shouldn't be possible, it's invoked either from connect_timeout or handle_connect_epoll and it shouldn't be possible to run both for the same FD or to run one of them multiple times.

@ston3lu
Copy link
Contributor Author

ston3lu commented Jan 29, 2026

Yes, the crash indeed occurs in the on_connect_peer function. All of this happened when the OSD was frequently restart.
image

So we'd better find and fix the real bug instead of replacing .at() with .find() != .end().

Understood, I will conduct a deeper analysis to identify the root cause.

@vitalif
Copy link
Owner

vitalif commented Jan 29, 2026

Yes, the crash indeed occurs in the on_connect_peer function. All of this happened when the OSD was frequently restart.

Okay... this one occurred in the check_peer_config() callback.

wanted_peers.erase() is only called in on_connect_peer(). It means that on_connect_peer() was already called for that OSD with a successful connection!

I suspect it's also related to "handle missing peer_fd". I.e. maybe it was a recreated client or something like that...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants