Skip to content

librdmacm: extend rsocket for Redis, iperf3, memcached and more Linux APIsRsocket upstream#1702

Open
BatshevaBlack wants to merge 14 commits intolinux-rdma:masterfrom
BatshevaBlack:rsocket_upstream
Open

librdmacm: extend rsocket for Redis, iperf3, memcached and more Linux APIsRsocket upstream#1702
BatshevaBlack wants to merge 14 commits intolinux-rdma:masterfrom
BatshevaBlack:rsocket_upstream

Conversation

@BatshevaBlack
Copy link

Summary

Extend the rsocket implementation in librdmacm so that applications such as Redis, iperf3, and memcached can use rsocket transparently via LD_PRELOAD (librspreload), and so rsocket aligns with more standard Linux socket and I/O behavior.

Motivation

The rsocket library did not fully support several POSIX/Linux interfaces (epoll, select, accept4, sendfile, fcntl64, and various socket options). Applications that rely on these either failed or fell back to TCP. This change extends the rsocket implementation to implement or fix those interfaces, so the preload can intercept them and route traffic over RDMA.

Changes

  • Add/fix epoll (epoll_create, epoll_create1, epoll_ctl, epoll_wait)
  • fix rpoll timeout handling and select,
  • fix cm_svc_run
  • add accept4, dup, fcntl64, sendfile64;
  • fix rfcntl
  • extend getsockopt/setsockopt;
  • fix SOCK_STREAM/SOCK_DGRAM handling and connect service/TCP behavior;
  • adjust wake-up timeout from rpoll.

@BatshevaBlack BatshevaBlack marked this pull request as draft February 17, 2026 12:13
@BatshevaBlack BatshevaBlack force-pushed the rsocket_upstream branch 3 times, most recently from 227fd60 to 959cb7e Compare February 23, 2026 10:28
This commit introduces epoll_create functionality to support a
centralized thread for managing all epoll instances.

The epoll_create call creates an epoll_inst struct and two epoll file
descriptors: a "regular epfd" for handling real file descriptors and
another epfd that includes the "regular epfd" added using epoll_ctl.
The latter epfd is returned from the epoll_create function.
Additionally, the new epoll instance is registered with a global thread
that processes all instances in a round-robin fashion, efficiently
handling events for both regular and rsocket file descriptors.

The global thread manages polling in two steps for each epoll instance.
First, it iterates through the list of rsocket fds in the epoll struct,
polling each one to check for events. Second, it calls epoll_wait on
the "regular epfd" to gather events from the real file descriptors.
The thread keeps the events in the struct, and proceeds to the next
epoll instance.

Signed-off-by: Batsheva Black <bblack@nvidia.com>
This commit implements epoll_ctl with tailored handling for real and
rsocket file descriptors.

For regular file descriptors, epoll_ctl directly operates on the
"regular epfd". For rsocket file descriptors, they are added to a
dedicated list maintained in the epoll instance struct. This list
ensures that the global thread can handle these file descriptors
during its polling cycle.

epoll_ctl triggers the thread to reprocess the epoll instance to update the
ready list. Reflecting any events on the newly added file descriptors.

Signed-off-by: Batsheva Black <bblack@nvidia.com>
@BatshevaBlack BatshevaBlack force-pushed the rsocket_upstream branch 12 times, most recently from bee544e to 7e7f2b6 Compare February 24, 2026 08:28
This commit implements epoll_wait to retrieve events processed by the
centralized thread for an epoll instance.

When epoll_wait is called, it copies the events collected by the global
thread from the ready list in the epoll instance to the user-provided
events buffer.

If no events are available in the `revents` field, the function triggers
the thread to recheck for events. Epoll_wait returns the total number of
ready events.

Signed-off-by: Batsheva Black <bblack@nvidia.com>
in case of timeout which causes poll to return, clear all signals that
arrived by calling rs_poll_exit.

Signed-off-by: Batsheva Black <bblack@nvidia.com>
Keep the list of the fds that are sent to poll in order to know
which fd belongs to each rfd when returning the revents to the
fds list.

Signed-off-by: Batsheva Black <bblack@nvidia.com>
The accept4 implementation extends accept to support the additional
atomic flag-setting functionality provided by accept4.

Signed-off-by: Batsheva Black <bblack@nvidia.com>
@BatshevaBlack BatshevaBlack force-pushed the rsocket_upstream branch 2 times, most recently from 0db3217 to fbb8d04 Compare February 24, 2026 08:45
@BatshevaBlack BatshevaBlack marked this pull request as ready for review February 24, 2026 08:59
@BatshevaBlack BatshevaBlack force-pushed the rsocket_upstream branch 5 times, most recently from 6b5d3bd to 1febaee Compare February 24, 2026 13:20
@BatshevaBlack BatshevaBlack force-pushed the rsocket_upstream branch 6 times, most recently from 7e6559d to e3a3c6d Compare February 25, 2026 08:29
Add preload interception for fcntl64 so rsocket file descriptors
support the same flag semantics as the glibc fcntl64 API.

Signed-off-by: Batsheva Black <bblack@nvidia.com>
getsockopt: TCP_INFO, TCP_CONGESTION, SO_BROADCAST & IP_TOS.
setsockopt: IP_TOS & TCP_CONGESTION.
Signed-off-by: Batsheva Black <bblack@nvidia.com>
rfcntl keeps the files flags all in the fd_flags argument.
Adding the new field fs_flags to the rs struct allows the
fcntl function to keep the file status flags separately
from the file descriptor flags.

Signed-off-by: Batsheva Black <bblack@nvidia.com>
Add preload interception for sendfile64 so applications using the
64-bit offset sendfile64 API work correctly with rsocket file descriptors.

Signed-off-by: Batsheva Black <bblack@nvidia.com>
Add preload interception for dup so that duplicating an rsocket file
descriptor produces another rsocket fd that refers to the same connection.

Signed-off-by: Batsheva Black <bblack@nvidia.com>
To allow us to respond to disconnect events initiated by
the peer kernel CM, run the connect service always with TCP
protocol- also when connection succeeds.

Signed-off-by: Batsheva Black <bblack@nvidia.com>
The changes to rpoll to use a signaling fd to wake up blocked threads,
combined with suspending polling while rsockets states may be changing
_should_ prevent any threads from blocking indefinitely in rpoll()
when a desired state change occurs.

We periodically wake up any polling thread, so that it can recheck its
rsocket states. The sleeping interval was set to an arbitrary value of
5 seconds, this interval is too long for apps that request a connection
and are dependent on the thread waking up, so it's changed now to 0.5
seconds, but can be overridden using config files.

Signed-off-by: Batsheva Black <bblack@nvidia.com>
	Updated type checks to identify socket types even when additional flags
	are present in the type field. Changed the comparison to use bitwise AND
	for more accurate detection.

Signed-off-by: Batsheva Black <bblack@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant