forked from eBay/HomeStore
-
Notifications
You must be signed in to change notification settings - Fork 1
Merge #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
hkadayam
wants to merge
130
commits into
master
Choose a base branch
from
merge
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
When replacing a member, add the new member, sync raft log for replace and finally remove the old member. Once we add new member, baseline or incremental resync will start. Remove the old member will cause nuraft mesg to exit the group and we periodically gc the destroyed group. Made the repl dev base test common so that both tests files can use. Tests by default create repl group with num_replica's. Dynamic tests create additional spare replica's which can be added to the test dynamically by calling replace member.
Sealer is a special consumer that provides information regarding where the cp is up to. It will be the first one during cp switch over , as a conservative marker of everything before or equals to this point, should be in current cp, possibly some consumer are above this point which is fine. And Sealer is the last one during cp flush after all other services flushed successfully. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Previous code can overflow the io_size, i.e remaining_io_size -= sub_io_size; where sub_io_size > remaining_io_size, and remaining_io_size is unsigned which will be a huge number, takes ages to finish. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
We see no space error in write_to_full ut, might due to when left space == max_wrt_sz and we take max_wrt_sz, however two extra blks are needed. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Add replica member info with name, priority and id. Use replica member info for replace member api and listener callbacks.
Signed-off-by: Jilong Kou <jkou@ebay.com>
Concurrent writes to m_down_buffers may cause data inconsistency. Add a mutex lock to IndexBuffer as well as extracting add/remove operations into member functions to make the vector thread-safe. Signed-off-by: Jilong Kou <jkou@ebay.com>
* Implement GC_REPL_REQ Based on DSN to Prevent Resource Leaks
This commit introduces a mechanism to garbage collect (GC) replication requests
(rreqs) that may hang indefinitely, thereby consuming memory and disk resources
unnecessarily. These rreqs can enter a hanging state under several
circumstances, as outlined below:
1. Scenario with Delayed Commit:
- Follower F1 receives LSN 100 and DSN 104 from Leader L1 and takes longer
than the raft timeout to precommit/commit it.
- L1 resends LSN 100, causing F1 to fetch the data again. Since LSN 100 was
committed in a previous attempt, this log entry is skipped, leaving the
rreq hanging indefinitely.
2. Scenario with Leader Failure Before Data Completion:
- Follower F1 receives LSN 100 from L1, but before all data is fetched/pushed,
L1 fails and L2 becomes the new leader.
- L2 resends LSN 100 with L2 as the new originator. F1 proceeds with the new
rreq and commits it, but the initial rreq from L1 hangs indefinitely as it
cannot fetch data from the new leader L2.
3. Scenario with Leader Failure After Data Write:
- Follower F1 receives data (DSN 104) from L1 and writes it. Before the log of
LSN 100 reaches F1, L1 fails and L2 becomes the new leader.
- L2 resends LSN 100 to F1, and F1 fetches DSN 104 from L2, leaving the
original rreq hanging.
This garbage collection process cleans up based on DSN. Any rreqs in
`m_repl_key_req_map`, whose DSN is already committed (`rreq->dsn <
repl_dev->m_next_dsn`), will be GC'd. This is safe on the follower side, as the
follower updates `m_next_dsn` during commit. Any DSN below `cur_dsn` should
already be committed, implying that the rreq should already be removed from
`m_repl_key_req_map`.
On the leader side, since `m_next_dsn` is updated when sending out the proposal,
it is not safe to clean up based on `m_next_dsn`. Therefore, we explicitly skip
the leader in this GC process.
Skipping localize raft logs we already committed.
Leader may send duplicate raft logs, if we localize them
unconditionally duplicate data will be written to chunk during
fetch_data.
It is safe for us to skip those logs that already committed,
there is no way those LSN can be over-written.
Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Data buffer persists in memory until rreq is committed or rolled back. This approach poses issues during recovery. As new data arrives via push_data and is written to disk, it remains in memory for an extended period until the replica catches up and commits the rreq. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
* add rollback on state machine --------- Signed-off-by: yawzhang <yawzhang@ebay.com>
* PushData only pushed to active followers. If a follower is lagging too far, do not flood it with data from new IOs (new rreq, new LSNs) , reserve the capability for catching up, that follower can request data via FetchData. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
when follower hits some error before appending log entries, it will set batch_size_hint_in_bytes to -1 to ask leader do not send more log entries in the next append_log_req. https://github.com/eBay/NuRaft/blob/eabdeeda538a27370943f79a2b08b5738b697ac3/src/handle_append_entries.cxx#L760 in nuobject case , if a new member is added to a raft group and it tries to append create_shard log entry , which will try to alllocate block from the chunks of the pg, before the create_pg log is committed , which will allocated chunks to this pg, and error will happen and the log batch containing create_shard log entry will be wholy rejected and set batch_size_hint_in_bytes to -1 in the response to leader. this pr aims to set the log count in the next batch sent to follower to 1, so that: if the create_pg and create_shard are in the same log batch , the pr will first reject this log batch and leader will send only create_pg in the next batch , which will be accepted by follower , since it will only create this pg. if if the create_pg and create_shard are not in the same log batch, and create_shard is trying to allocate block before the pg it created(chunks of this pg is alllocated), then , with this pr, follower will reject this batch so that it will give more time to creating pg. create_shard log will be resent in the next batch , and at that moment pg has probably already been successfully be created.
We dont need to panic in this case, fetchData can handle this. Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
Add application_hint to the blk_alloc_hints structure. This change addresses the need for certain users of homestore, such as homeobject, to pass additional hints. The application_hint can be used to specify behavior in the select_chunk interface.
1 consume nuraft::cb_func::Type::RemovedFromCluster callback 2 add reset function to allocator/vchunk as a preparation for implementing m_listener->on_destroy()
* release data before set m_data_written_promise authored-by: yawzhang <yawzhang@ebay.com>
…e` with num_chunks or chunk_size. Prioritize `num_chunks` over `chunk_size` if both are provided.
* Support Baseline resync For Nuraft baseline resync, we separate the process into two layers: HomeStore layer and Application layer. We use the first bit of the obj_id to indicate the message type: 0 is for HS, 1 is for Application. In the HomeStore layer, leader needs to transmit the DSN to the follower, this is intended to handle the following case: 1. Leader sends snapshot at LSN T1 to follower F1. 2. F1 fully receives the snapshot and now at T1. 3. Leader yield its leadership, F1 elected as leader. In this sequence the incremental resync will not kicked in to update the m_next_dsn, and as result, duplication may occur.
…he lastest committed lsn to upper layer (eBay#703)
Add support for async write data, journal, alloc blks for solo repl dev. Raft repl dev doesnt support these operations. This is needed for nublocks where it need to write free blkids also to the journal. Free blocks are obtained after writing the new blkids to index. Add apis for allocation and write for vector of blkids . Raft repldev currently uses only a single blkid. Test solo repl dev changes to support vector of blkids.
…/destroy ra… (eBay#715) * Fix log periodic cancelt_imer issue and solo repl dev init/destroy race issue
* Wait on cancel_timer during stop logdev
Use submit_io_batch when part_of_batch is set to true for read/write.
This PR has following big changes 1. Introduce multiple index support, so that homestore can actually have different types of Index stores. 2. Introduce a new Btree called CopyOnWrite Btree, instead of inplace btree where the btree pages are not written in place, but on different location, but maintain a map. 3. Make the public interfaces to be very concise (having a BtreeBase and put that in the implementation) 4. Simplified the btree apis 5. Used latest sisl 13.x with REGISTER_LOG_MODS 6. Added cow btree crash test, updated other tests to ensure pass
This PR has following big changes * COWBtree recovery test cases with variable cps and fixes * Added cow btree crash test, updated other tests to ensure pass * Btree Node allocators and variants * Multiple Btreenode fixes
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Merge with main fork and also made replication as an optional support and could be conditionally compiled out.