Bug fixes for some resource leaks and access to freed resources, etc by ston3lu · Pull Request #122 · vitalif/vitastor

ston3lu · 2026-01-27T08:34:57Z

I accept Vitastor CLA agreement: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md

ston3lu · 2026-01-27T08:42:10Z

Hi.
During testing this issue (#120), which caused frequent OSD restarts. Subsequently, I discovered several resource leaks or issues access to freed resources.

Hope this is helpful.

vitalif · 2026-01-28T22:47:56Z

Hi, thank you very much for your fixes, I'm only doubtful about the "handle missing peer clients" change. Did you actually see crashes at that .at()?

Problem is that just checking by peer_fd isn't really correct because if a peer_fd is freed and closed then the FD number may be reused. Checking the peer by osd_client_t* pointer isn't correct either because it can also be reused - allocators like to reuse chunks of the same size.

So msgr_receive.cpp tries to guarantee that it never calls exec_op() on destroyed clients. Now it even clears postponed operations from the "immediate callback queue" when the client is destroyed...

And if you've seen crashes where the client reference was absent by peer_fd, then it probably means this protection didn't work and that it could also refer to an invalid reused peer_fd as well.

So we'd better find and fix the real bug instead of replacing .at() with .find() != .end().

From looking at the code it seems that this protection may only break when a recovery operation is postponed by recovery_target_sleep_us...

vitalif · 2026-01-28T22:54:39Z

"Fix peer connection handling by checking for existing entries" is also similar - did you really observe crashes in on_connect_peer? It shouldn't be possible, it's invoked either from connect_timeout or handle_connect_epoll and it shouldn't be possible to run both for the same FD or to run one of them multiple times.

ston3lu · 2026-01-29T01:28:42Z

Yes, the crash indeed occurs in the on_connect_peer function. All of this happened when the OSD was frequently restart.

So we'd better find and fix the real bug instead of replacing .at() with .find() != .end().

Understood, I will conduct a deeper analysis to identify the root cause.

vitalif · 2026-01-29T10:13:21Z

Yes, the crash indeed occurs in the on_connect_peer function. All of this happened when the OSD was frequently restart.

Okay... this one occurred in the check_peer_config() callback.

wanted_peers.erase() is only called in on_connect_peer(). It means that on_connect_peer() was already called for that OSD with a successful connection!

I suspect it's also related to "handle missing peer_fd". I.e. maybe it was a recreated client or something like that...

ston3lu added 4 commits January 27, 2026 08:29

Fix peer connection handling by checking for existing entries

77ba706

clean RDMA entry

141fbe1

Handle missing peer clients in secondary paths

477ae56

Fix ref leak on read path when SQE allocation fails

2ccda97

ston3lu changed the title ~~Fixed some bugs encountered during frequent OSD restarts~~ Bug fixes for some resource leaks and access to freed resources, etc Jan 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fixes for some resource leaks and access to freed resources, etc#122

Bug fixes for some resource leaks and access to freed resources, etc#122
ston3lu wants to merge 4 commits intovitalif:masterfrom
ston3lu:ston3lu-patch-0127

ston3lu commented Jan 27, 2026 •

edited

Loading

Uh oh!

ston3lu commented Jan 27, 2026 •

edited

Loading

Uh oh!

vitalif commented Jan 28, 2026

Uh oh!

vitalif commented Jan 28, 2026 •

edited

Loading

Uh oh!

ston3lu commented Jan 29, 2026

Uh oh!

vitalif commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ston3lu commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ston3lu commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vitalif commented Jan 28, 2026

Uh oh!

vitalif commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ston3lu commented Jan 29, 2026

Uh oh!

vitalif commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ston3lu commented Jan 27, 2026 •

edited

Loading

ston3lu commented Jan 27, 2026 •

edited

Loading

vitalif commented Jan 28, 2026 •

edited

Loading