Merge #7

hkadayam · 2025-08-07T22:48:04Z

Merge with main fork and also made replication as an optional support and could be conditionally compiled out.

When replacing a member, add the new member, sync raft log for replace and finally remove the old member. Once we add new member, baseline or incremental resync will start. Remove the old member will cause nuraft mesg to exit the group and we periodically gc the destroyed group. Made the repl dev base test common so that both tests files can use. Tests by default create repl group with num_replica's. Dynamic tests create additional spare replica's which can be added to the test dynamically by calling replace member.

Sealer is a special consumer that provides information regarding where the cp is up to. It will be the first one during cp switch over , as a conservative marker of everything before or equals to this point, should be in current cp, possibly some consumer are above this point which is fine. And Sealer is the last one during cp flush after all other services flushed successfully. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

)

Previous code can overflow the io_size, i.e remaining_io_size -= sub_io_size; where sub_io_size > remaining_io_size, and remaining_io_size is unsigned which will be a huge number, takes ages to finish. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

We see no space error in write_to_full ut, might due to when left space == max_wrt_sz and we take max_wrt_sz, however two extra blks are needed. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Add replica member info with name, priority and id. Use replica member info for replace member api and listener callbacks.

Signed-off-by: Jilong Kou <jkou@ebay.com>

Concurrent writes to m_down_buffers may cause data inconsistency. Add a mutex lock to IndexBuffer as well as extracting add/remove operations into member functions to make the vector thread-safe. Signed-off-by: Jilong Kou <jkou@ebay.com>

* Implement GC_REPL_REQ Based on DSN to Prevent Resource Leaks This commit introduces a mechanism to garbage collect (GC) replication requests (rreqs) that may hang indefinitely, thereby consuming memory and disk resources unnecessarily. These rreqs can enter a hanging state under several circumstances, as outlined below: 1. Scenario with Delayed Commit: - Follower F1 receives LSN 100 and DSN 104 from Leader L1 and takes longer than the raft timeout to precommit/commit it. - L1 resends LSN 100, causing F1 to fetch the data again. Since LSN 100 was committed in a previous attempt, this log entry is skipped, leaving the rreq hanging indefinitely. 2. Scenario with Leader Failure Before Data Completion: - Follower F1 receives LSN 100 from L1, but before all data is fetched/pushed, L1 fails and L2 becomes the new leader. - L2 resends LSN 100 with L2 as the new originator. F1 proceeds with the new rreq and commits it, but the initial rreq from L1 hangs indefinitely as it cannot fetch data from the new leader L2. 3. Scenario with Leader Failure After Data Write: - Follower F1 receives data (DSN 104) from L1 and writes it. Before the log of LSN 100 reaches F1, L1 fails and L2 becomes the new leader. - L2 resends LSN 100 to F1, and F1 fetches DSN 104 from L2, leaving the original rreq hanging. This garbage collection process cleans up based on DSN. Any rreqs in `m_repl_key_req_map`, whose DSN is already committed (`rreq->dsn < repl_dev->m_next_dsn`), will be GC'd. This is safe on the follower side, as the follower updates `m_next_dsn` during commit. Any DSN below `cur_dsn` should already be committed, implying that the rreq should already be removed from `m_repl_key_req_map`. On the leader side, since `m_next_dsn` is updated when sending out the proposal, it is not safe to clean up based on `m_next_dsn`. Therefore, we explicitly skip the leader in this GC process. Skipping localize raft logs we already committed. Leader may send duplicate raft logs, if we localize them unconditionally duplicate data will be written to chunk during fetch_data. It is safe for us to skip those logs that already committed, there is no way those LSN can be over-written. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Data buffer persists in memory until rreq is committed or rolled back. This approach poses issues during recovery. As new data arrives via push_data and is written to disk, it remains in memory for an extended period until the replica catches up and commits the rreq. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

* add rollback on state machine --------- Signed-off-by: yawzhang <yawzhang@ebay.com>

* PushData only pushed to active followers. If a follower is lagging too far, do not flood it with data from new IOs (new rreq, new LSNs) , reserve the capability for catching up, that follower can request data via FetchData. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

when follower hits some error before appending log entries, it will set batch_size_hint_in_bytes to -1 to ask leader do not send more log entries in the next append_log_req. https://github.com/eBay/NuRaft/blob/eabdeeda538a27370943f79a2b08b5738b697ac3/src/handle_append_entries.cxx#L760 in nuobject case , if a new member is added to a raft group and it tries to append create_shard log entry , which will try to alllocate block from the chunks of the pg, before the create_pg log is committed , which will allocated chunks to this pg, and error will happen and the log batch containing create_shard log entry will be wholy rejected and set batch_size_hint_in_bytes to -1 in the response to leader. this pr aims to set the log count in the next batch sent to follower to 1, so that: if the create_pg and create_shard are in the same log batch , the pr will first reject this log batch and leader will send only create_pg in the next batch , which will be accepted by follower , since it will only create this pg. if if the create_pg and create_shard are not in the same log batch, and create_shard is trying to allocate block before the pg it created(chunks of this pg is alllocated), then , with this pr, follower will reject this batch so that it will give more time to creating pg. create_shard log will be resent in the next batch , and at that moment pg has probably already been successfully be created.

We dont need to panic in this case, fetchData can handle this. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Add application_hint to the blk_alloc_hints structure. This change addresses the need for certain users of homestore, such as homeobject, to pass additional hints. The application_hint can be used to specify behavior in the select_chunk interface.

1 consume nuraft::cb_func::Type::RemovedFromCluster callback 2 add reset function to allocator/vchunk as a preparation for implementing m_listener->on_destroy()

* release data before set m_data_written_promise authored-by: yawzhang <yawzhang@ebay.com>

…e` with num_chunks or chunk_size. Prioritize `num_chunks` over `chunk_size` if both are provided.

* Support Baseline resync For Nuraft baseline resync, we separate the process into two layers: HomeStore layer and Application layer. We use the first bit of the obj_id to indicate the message type: 0 is for HS, 1 is for Application. In the HomeStore layer, leader needs to transmit the DSN to the follower, this is intended to handle the following case: 1. Leader sends snapshot at LSN T1 to follower F1. 2. F1 fully receives the snapshot and now at T1. 3. Leader yield its leadership, F1 elected as leader. In this sequence the incremental resync will not kicked in to update the m_next_dsn, and as result, duplication may occur.

…he lastest committed lsn to upper layer (eBay#703)

Add support for async write data, journal, alloc blks for solo repl dev. Raft repl dev doesnt support these operations. This is needed for nublocks where it need to write free blkids also to the journal. Free blocks are obtained after writing the new blkids to index. Add apis for allocation and write for vector of blkids . Raft repldev currently uses only a single blkid. Test solo repl dev changes to support vector of blkids.

)

…/destroy ra… (eBay#715) * Fix log periodic cancelt_imer issue and solo repl dev init/destroy race issue

…ay#724)

* Wait on cancel_timer during stop logdev

Use submit_io_batch when part_of_batch is set to true for read/write.

…Bay#732)

…y#735)

This PR has following big changes 1. Introduce multiple index support, so that homestore can actually have different types of Index stores. 2. Introduce a new Btree called CopyOnWrite Btree, instead of inplace btree where the btree pages are not written in place, but on different location, but maintain a map. 3. Make the public interfaces to be very concise (having a BtreeBase and put that in the implementation) 4. Simplified the btree apis 5. Used latest sisl 13.x with REGISTER_LOG_MODS 6. Added cow btree crash test, updated other tests to ensure pass

This PR has following big changes * COWBtree recovery test cases with variable cps and fixes * Added cow btree crash test, updated other tests to ensure pass * Btree Node allocators and variants * Multiple Btreenode fixes

sanebay and others added 30 commits August 5, 2025 07:42

Start data service after log replay done.

c4fcf70

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Flushing log after data written.

e223283

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Add raft commit quorum for replace member if two members down. (eBay#559

2c853a3

)

Add cert watcher

67fe181

fix nit

5e9fe1d

Fix read_io in dataservice test.

d331d32

Previous code can overflow the io_size, i.e remaining_io_size -= sub_io_size; where sub_io_size > remaining_io_size, and remaining_io_size is unsigned which will be a huge number, takes ages to finish. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

FIX wbcache for put and modify long running index (eBay#567)

90fd1a0

Count in ovf headers.

69e621c

We see no space error in write_to_full ut, might due to when left space == max_wrt_sz and we take max_wrt_sz, however two extra blks are needed. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Reduce logs (eBay#571)

e6cb8ea

Change replace member api signature.

6756e95

Add replica member info with name, priority and id. Use replica member info for replace member api and listener callbacks.

Add package version and show in log (eBay#575)

d0c4e2b

add chunksize to vchunk interface (eBay#572)

1054b00

Add index CR UT for basic merge (eBay#556)

f6cd30f

Signed-off-by: Jilong Kou <jkou@ebay.com>

Add additional tests for replace member (eBay#574)

b4da34e

add rollback on state machine add open Leader_Restart ut (eBay#585)

28ea01c

* add rollback on state machine --------- Signed-off-by: yawzhang <yawzhang@ebay.com>

Set min_log_gap_to_join to max_int32 and enabled new_joiner_type

6c748e8

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Checking received data size and reject if not match.

dee5fed

We dont need to panic in this case, fetchData can handle this. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>

Disable dynamic repl ut temporarily. (eBay#593)

a5d6d9b

handle RemovedFromCluster event (eBay#594)

aebbe92

1 consume nuraft::cb_func::Type::RemovedFromCluster callback 2 add reset function to allocator/vchunk as a preparation for implementing m_listener->on_destroy()

Fix grpc crash (eBay#595)

58882a2

* release data before set m_data_written_promise authored-by: yawzhang <yawzhang@ebay.com>

Support flexible virtual device creation in `homestore::BlkDataServic…

e5bb0f7

…e` with num_chunks or chunk_size. Prioritize `num_chunks` over `chunk_size` if both are provided.

yuwmao and others added 27 commits August 5, 2025 09:11

Set priority when create RaftReplDev (eBay#695)

91442fd

add traceid for replace member

f07eca2

adopt api signature change

7ba29b7

support handling config rollback and add periodical notification of t…

1a669fe

…he lastest committed lsn to upper layer (eBay#703)

fix: init rkey with trace id

58586c4

Bump up hub.tess.io/sds/sds_develop in DockerFile (eBay#709)

3d87d6c

Move sanitizer builds to its own location. (eBay#710)

30f6b68

Revert btree config file (eBay#711)

9428bb6

Issue 713: Fix index table destroy race with wb_cache cp flush (eBay#714

0a40669

)

Issue 716: Fix log periodic cancelt_imer issue and solo repl dev init…

cee9bac

…/destroy ra… (eBay#715) * Fix log periodic cancelt_imer issue and solo repl dev init/destroy race issue

Issue 717: expose data service drive type (eBay#718)

a302aa6

Fix occupied_size for prefix (eBay#719)

f46994b

add long running test with put and remove

b8b1a6d

Fix prefix - reload compactbitset after updating node phys buffer (eB…

2be54cd

…ay#724)

[Solo repl dev] Fix log dev flush timer cancel race (eBay#723)

8c61a8a

* Wait on cancel_timer during stop logdev

Add submit_io_batch api in repl dev. (eBay#725)

417929e

Use submit_io_batch when part_of_batch is set to true for read/write.

Redesign replacemember API

f254520

Add a reaper thread to check and complete replace member

156e0ab

Fix replace_member

be10be4

fix bug in get_replication_status

a05b089

The usage of EVP_DigestInit_ex2 in meta_blk test requires openssl3.x (e…

f300348

…Bay#732)

Fix prefix merge and enable long running (eBay#729)

aa28554

Fix overlapping range and enable index crash recovery for prefix (eBa…

f0b5159

…y#735)

Improvements2 (#4)

baf4ef2

This PR has following big changes * COWBtree recovery test cases with variable cps and fixes * Added cow btree crash test, updated other tests to ensure pass * Btree Node allocators and variants * Multiple Btreenode fixes

Update build_commit.yml to do merge build

8f1649c

hkadayam force-pushed the merge branch from cb9f480 to 4aac87c Compare August 7, 2025 23:42

Merge with main fork and also made replication as an optional support

14156c0

hkadayam force-pushed the merge branch from 4aac87c to 14156c0 Compare August 8, 2025 01:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge #7

Merge #7

Uh oh!

hkadayam commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Merge #7

Are you sure you want to change the base?

Merge #7

Uh oh!

Conversation

hkadayam commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants