Skip to content

Conversation

@hkadayam
Copy link
Owner

@hkadayam hkadayam commented Aug 7, 2025

Merge with main fork and also made replication as an optional support and could be conditionally compiled out.

sanebay and others added 30 commits August 5, 2025 07:42
When replacing a member, add the new member, sync raft log
for replace and finally remove the old member. Once we add
new member, baseline or incremental resync will start.
Remove the old member will cause nuraft mesg to exit
the group and we periodically gc the destroyed group.
Made the repl dev base test common so that both tests files
can use. Tests by default create repl group with num_replica's.
Dynamic tests create additional spare replica's which can be
added to the test dynamically by calling replace member.
Sealer is a special consumer that provides information regarding where the cp is up to.
It will be the first one during cp switch over , as a conservative marker of everything
before or equals to this point, should be in current cp, possibly some consumer are above this point which is fine.
And Sealer is the last one during cp flush after all other services flushed successfully.

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Previous code can overflow the io_size, i.e

remaining_io_size -= sub_io_size;

where sub_io_size > remaining_io_size, and
remaining_io_size is unsigned which will be
a huge number, takes ages to finish.

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
We see no space error in write_to_full ut, might
due to when left space == max_wrt_sz and we take max_wrt_sz,
however two extra blks are needed.

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Add replica member info with name, priority and id.
Use replica member info for replace member api and listener callbacks.
Signed-off-by: Jilong Kou <jkou@ebay.com>
Concurrent writes to m_down_buffers may cause data inconsistency.
Add a mutex lock to IndexBuffer as well as extracting add/remove
operations into member functions to make the vector thread-safe.

Signed-off-by: Jilong Kou <jkou@ebay.com>
* Implement GC_REPL_REQ Based on DSN to Prevent Resource Leaks

This commit introduces a mechanism to garbage collect (GC) replication requests
(rreqs) that may hang indefinitely, thereby consuming memory and disk resources
unnecessarily. These rreqs can enter a hanging state under several
circumstances, as outlined below:

1. Scenario with Delayed Commit:
   - Follower F1 receives LSN 100 and DSN 104 from Leader L1 and takes longer
     than the raft timeout to precommit/commit it.
   - L1 resends LSN 100, causing F1 to fetch the data again. Since LSN 100 was
     committed in a previous attempt, this log entry is skipped, leaving the
     rreq hanging indefinitely.

2. Scenario with Leader Failure Before Data Completion:
   - Follower F1 receives LSN 100 from L1, but before all data is fetched/pushed,
     L1 fails and L2 becomes the new leader.
   - L2 resends LSN 100 with L2 as the new originator. F1 proceeds with the new
     rreq and commits it, but the initial rreq from L1 hangs indefinitely as it
     cannot fetch data from the new leader L2.

3. Scenario with Leader Failure After Data Write:
   - Follower F1 receives data (DSN 104) from L1 and writes it. Before the log of
     LSN 100 reaches F1, L1 fails and L2 becomes the new leader.
   - L2 resends LSN 100 to F1, and F1 fetches DSN 104 from L2, leaving the
     original rreq hanging.

This garbage collection process cleans up based on DSN. Any rreqs in
`m_repl_key_req_map`, whose DSN is already committed (`rreq->dsn <
repl_dev->m_next_dsn`), will be GC'd. This is safe on the follower side, as the
follower updates `m_next_dsn` during commit. Any DSN below `cur_dsn` should
already be committed, implying that the rreq should already be removed from
`m_repl_key_req_map`.

On the leader side, since `m_next_dsn` is updated when sending out the proposal,
it is not safe to clean up based on `m_next_dsn`. Therefore, we explicitly skip
the leader in this GC process.

Skipping localize raft logs we already committed.

Leader may send duplicate raft logs, if we localize them
unconditionally duplicate data will be written to chunk during
fetch_data.

It is safe for us to skip those logs that already committed,
there is no way those LSN can be over-written.

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Data buffer persists in memory until rreq is committed or rolled back.

This approach poses issues during recovery. As new data arrives via
push_data and is written to disk, it remains in memory for an extended
period until the replica catches up and commits the rreq.

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
* add rollback on state machine

---------

Signed-off-by: yawzhang <yawzhang@ebay.com>
* PushData only pushed to active followers.

If a follower is lagging too far, do not flood it with data
from new IOs (new rreq, new LSNs) ,  reserve the capability
for catching up,  that follower can request data via FetchData.

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
when follower hits some error before appending log entries, it will set batch_size_hint_in_bytes to -1 to ask leader do not send more log entries in the next append_log_req.
https://github.com/eBay/NuRaft/blob/eabdeeda538a27370943f79a2b08b5738b697ac3/src/handle_append_entries.cxx#L760

in nuobject case , if a new member is added to a raft group and it tries to append create_shard log entry , which will try to alllocate block from the chunks of the pg, before the create_pg log is committed , which will allocated chunks to this pg, and error will happen and the log batch containing create_shard log entry will be wholy rejected and set batch_size_hint_in_bytes to -1 in the response to leader.

this pr aims to set the log count in the next batch sent to follower to 1, so that:

if the create_pg and create_shard are in the same log batch , the pr will first reject this log batch and leader will send only create_pg in the next batch , which will be accepted by follower , since it will only create this pg.

if if the create_pg and create_shard are not in the same log batch, and create_shard is trying to allocate block before the pg it created(chunks of this pg is alllocated), then , with this pr, follower will reject this batch so that it will give more time to creating pg. create_shard log will be resent in the next batch , and at that moment pg has probably already been successfully be created.
We dont need to panic in this case, fetchData can handle this.

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Add application_hint to the blk_alloc_hints structure. This change addresses the need for certain users of homestore, such as homeobject, to pass additional hints. The application_hint can be used to specify behavior in the select_chunk interface.
1 consume nuraft::cb_func::Type::RemovedFromCluster callback
2 add reset function to allocator/vchunk as a preparation for implementing m_listener->on_destroy()
* release data before set m_data_written_promise
authored-by: yawzhang <yawzhang@ebay.com>
…e` with num_chunks or chunk_size.

Prioritize `num_chunks` over `chunk_size` if both are provided.
* Support Baseline resync

For Nuraft baseline resync, we separate the process into two layers: HomeStore layer and Application layer.
We use the first bit of the obj_id to indicate the message type: 0 is for HS, 1 is for Application.

In the HomeStore layer, leader needs to transmit the DSN to the follower, this is intended to handle the following case:

1. Leader sends snapshot at LSN T1 to follower F1.
2. F1 fully receives the snapshot and now at T1.
3. Leader yield its leadership, F1 elected as leader.
In this sequence the incremental resync will not kicked in to update the m_next_dsn, and as result, duplication may occur.
yuwmao and others added 27 commits August 5, 2025 09:11
Add support for async write data, journal, alloc blks for solo
repl dev. Raft repl dev doesnt support these operations.
This is needed for nublocks where it need to write
free blkids also to the journal. Free blocks are obtained
after writing the new blkids to index. Add apis for allocation
and write for vector of blkids . Raft repldev currently uses only
a single blkid. Test solo repl dev changes to support vector of blkids.
…/destroy ra… (eBay#715)

* Fix log periodic cancelt_imer issue and solo repl dev init/destroy race issue
Use submit_io_batch when part_of_batch is set to true for read/write.
This PR has following big changes
1. Introduce multiple index support, so that homestore can actually have
different types of Index stores.

2. Introduce a new Btree called CopyOnWrite Btree, instead of inplace btree
where the btree pages are not written in place, but on different location, but
maintain a map.

3. Make the public interfaces to be very concise (having a BtreeBase and put that
in the implementation)

4. Simplified the btree apis

5. Used latest sisl 13.x with REGISTER_LOG_MODS

6. Added cow btree crash test, updated other tests to ensure pass
This PR has following big changes
* COWBtree recovery test cases with variable cps and fixes

* Added cow btree crash test, updated other tests to ensure pass

* Btree Node allocators and variants

* Multiple Btreenode fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.