HDDS-14239. Simplify Prepare Batch looping for DeleteRange Op processing in RDBBatchOperation#9553
HDDS-14239. Simplify Prepare Batch looping for DeleteRange Op processing in RDBBatchOperation#9553swamirishi wants to merge 4 commits intoapache:masterfrom
Conversation
891711f to
78e1c8d
Compare
|
This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days. |
|
@swamirishi, are you working on this? |
|
…nges and simple logic Change-Id: I7a06f5a427ba7d1c4f760ff70fcb4093abaa3cbd
78e1c8d to
8ac6358
Compare
|
@szetszwo @jojochuang Can you take a look at this change? |
|
@swamirishi , in our current usage, is it the case that DeleteRange won't mix with other |
No that is not the case @szetszwo it completely depends on the transactions that happen together in the same double buffer. This sort of intersection can happen even in the same DirectoryPurgeRequest, look at #9423 DirectoryDeletingService purge path request could be opening up deleted sub directory(move from DirectoryTable to deletedDirectoryTable) and also deleting the same entry from deletedDirectoryTable because of the optimizeDirDeleteAndSubmitRequest in the same transaction. |
|
This PR #9690 improves the efficiency of finding out the intersection from O(n) to log(n) |
@swamirishi , I agree. I mean that the current code does not have such transactions and we are not going to support such code.
Let's talk about the correctness before talking about efficiency. As you mentioned in the description, the previously implementation (quoted below) indeed has a bug. So this change is very risky.
|
It was not a bug but just inefficient. We are good as long as the order of operations are correct. |
We had done it the earlier way consciously since the DELETE RANGE was only being used for Snapshot CREATE and there was going to be no intersection of DELETE RANGE with any of the operation beyond one transaction since SNAPSHOT CREATE splits the double buffer batch. This changes proposed here is for more generic use cases like @aryangupta1998 's PR change which is also a safe change. The earlier implementation had an Order of O(n + k) and with this PR we would make it O(n * k) where n is the number of Single Key Op and k is number of Delete range operations in a batch(P.S. we don't anticipate having a lot of delete range in one batch K should be ideally much smaller than n). The next PR brings this efficiency down to O(n Log k). |
|
@swamirishi , sorry that this does not look like safe change since it dramatically changes RDBBatchOperation. In particular, this PR adds the new
|
That is not even possible. It means double buffer would have to understand how the DBResponse interface is going to flush the transaction on a db. One transaction can do both put, delete and delete range. We can never guarantee one transaction is going to have only delete range operation. All of them need to be flushed in the same rocksdb batch for atomicity guarantees. |
Today, we don't have batch DeleteRange and everything works. If we add pure batch DeleteRange tomorrow, would anything become not working? |
Change-Id: Ia3db219dab70a7a2faf82009e8f0dac59c665938
Change-Id: Icae0e1299fcb7d37bd5a0f3c2f405537b1d68f3e
We don't have deleteRange because we reverted #8774 and snapshot create was using delete range operation. Without this patch we are going to kill snapshot create performance and the optimization to improve rocksdb performance in case of huge deletes. |
I see you point. How about extending RDBBatchOperation in a subclass, say RDBBatchOperationWithDeleteRange? The new code goes to RDBBatchOperationWithDeleteRange while keeping RDBBatchOperation unchanged. In case there is a bug, people not using OM snapshot won't be affected. |
Are you saying we have separate implementation and make it configurable? Then it would mean we would have to branch the logic OmResponse where it would have to understand which implementation has been passed |
Configurable is a good idea:
No. We could add a RDBBatchOperationFactory. |
| <ranger.version>2.7.0</ranger.version> | ||
| <!-- versions included in ratis-thirdparty, update in sync --> | ||
| <ratis-thirdparty.grpc.version>1.75.0</ratis-thirdparty.grpc.version> | ||
| <ratis-thirdparty.grpc.version>1.77.0</ratis-thirdparty.grpc.version> |
There was a problem hiding this comment.
The bump doesn't look to be required in this PR?
|
This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days. |
|
@swamirishi Is this ready for review? Is this marked as draft only for CI reason? |
There was a problem hiding this comment.
Pull request overview
This PR simplifies and optimizes RocksDB batch preparation when DeleteRange operations are present, and exposes deleteRangeWithBatch through the Table API so callers (eg, OM snapshot cleanup) can use native RocksDB range deletes instead of iterating keys.
Changes:
- Add
Table#deleteRangeWithBatch(...)and implement it acrossTypedTable,RDBTable, and wrappers. - Extend
RDBBatchOperationto supportdeleteRange(...)and to discard earlier single-key ops that are made redundant by later delete-range ops. - Update OM snapshot bucket-prefix cleanup to use a batch delete-range; add/adjust unit tests for delete-range behavior and op ordering.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pom.xml | Bumps ratis-thirdparty gRPC version. |
| hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java | Switches prefix deletions to deleteRangeWithBatch instead of iterating keys. |
| hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/Table.java | Adds new deleteRangeWithBatch API to Table. |
| hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/TypedTable.java | Implements deleteRangeWithBatch by delegating to raw table with encoded keys. |
| hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBTable.java | Implements deleteRangeWithBatch via RDBBatchOperation.deleteRange. |
| hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RocksDatabase.java | Adds ColumnFamily.batchDeleteRange(...) helper. |
| hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBBatchOperation.java | Adds delete-range op type and updates batch preparation to account for delete-range redundancy. |
| hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/metadata/DatanodeTable.java | Wires through the new deleteRangeWithBatch API in the wrapper. |
| hadoop-hdds/framework/src/test/java/org/apache/hadoop/hdds/utils/db/InMemoryTestTable.java | Adds the new method (unsupported in this in-memory test table). |
| hadoop-hdds/framework/src/test/java/org/apache/hadoop/hdds/utils/db/TestRDBTableStore.java | Adds tests validating delete-range batch semantics and ordering. |
| hadoop-hdds/framework/src/test/java/org/apache/hadoop/hdds/utils/db/TestRDBBatchOperation.java | Updates batch-operation tests to include delete-range behavior and randomized coverage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| String getCommitString() { | ||
| int putCount = 0; | ||
| int delCount = 0; | ||
| int opSize = 0; | ||
| int discardedCount = 0; | ||
| int discardedSize = 0; | ||
| int delRangeCount = 0; | ||
|
|
||
| for (FamilyCache f : name2cache.values()) { | ||
| putCount += f.putCount; | ||
| delCount += f.delCount; | ||
| opSize += f.batchSize; | ||
| discardedCount += f.discardedCount; | ||
| discardedSize += f.discardedSize; | ||
| delRangeCount += f.delRangeCount; | ||
| } | ||
|
|
||
| final int opCount = putCount + delCount; | ||
| return String.format( | ||
| "#put=%s, #del=%s, batchSize: %s, discarded: %s, committed: %s", | ||
| putCount, delCount, | ||
| "#put=%s, #del=%s, #delRange=%s, batchSize: %s, discarded: %s, committed: %s", | ||
| putCount, delCount, delRangeCount, | ||
| countSize2String(opCount, opSize), | ||
| countSize2String(discardedCount, discardedSize), | ||
| countSize2String(opCount - discardedCount, opSize - discardedSize)); |
There was a problem hiding this comment.
FamilyCache accounting/logging does not include delete-range ops: deleteRange(...) doesn't update batchSize, and getCommitString() computes opCount as putCount + delCount (excluding delRangeCount). This makes commit stats (and the discarded-changes warning) inaccurate, especially for batches containing only delete-range operations. Track delete-range sizes in batchSize and include delRangeCount in the operation counts used for logging.
|
|
||
| void deleteRange(byte[] startKey, byte[] endKey) { | ||
| delRangeCount++; | ||
| ops.put(opIndex.getAndIncrement(), new DeleteRangeOperation(startKey, endKey)); |
There was a problem hiding this comment.
deleteRange(...) increments delRangeCount but does not update batchSize/discard accounting. This makes batchSizeDiscardedString() and the “discarding changes” warning unreliable for batches that include delete-range ops. Consider adding the range op’s totalLength() to batchSize (and/or tracking delete-range bytes separately) for consistent accounting.
| ops.put(opIndex.getAndIncrement(), new DeleteRangeOperation(startKey, endKey)); | |
| DeleteRangeOperation op = new DeleteRangeOperation(startKey, endKey); | |
| batchSize += op.totalLength(); | |
| ops.put(opIndex.getAndIncrement(), op); |
| * Test class for verifying batch operations with delete ranges using the | ||
| * RDBBatchOperation and MockedConstruction of ManagedWriteBatch. | ||
| * | ||
| * This test class includes: | ||
| * - Mocking and tracking of operations including put, delete, and delete range | ||
| * within a batch operation. | ||
| * - Validation of committed operations using assertions on collected data. | ||
| * - Ensures that the batch operation interacts correctly with the | ||
| * RocksDatabase and ColumnFamilyHandle components. | ||
| * | ||
| * The test method includes: | ||
| * 1. Setup of mocked ColumnFamilyHandle and RocksDatabase.ColumnFamily. | ||
| * 2. Mocking of methods to track operations performed on*/ |
There was a problem hiding this comment.
The class-level Javadoc is incomplete/inaccurate (it ends mid-sentence at “performed on” and mentions “MockedConstruction of ManagedWriteBatch” even though the test uses TrackingUtilManagedWriteBatchForTesting). Please fix/trim the Javadoc so it accurately reflects the current test setup.
| * Test class for verifying batch operations with delete ranges using the | |
| * RDBBatchOperation and MockedConstruction of ManagedWriteBatch. | |
| * | |
| * This test class includes: | |
| * - Mocking and tracking of operations including put, delete, and delete range | |
| * within a batch operation. | |
| * - Validation of committed operations using assertions on collected data. | |
| * - Ensures that the batch operation interacts correctly with the | |
| * RocksDatabase and ColumnFamilyHandle components. | |
| * | |
| * The test method includes: | |
| * 1. Setup of mocked ColumnFamilyHandle and RocksDatabase.ColumnFamily. | |
| * 2. Mocking of methods to track operations performed on*/ | |
| * Tests batch operations (including delete ranges) using {@link RDBBatchOperation} | |
| * together with {@link TrackingUtilManagedWriteBatchForTesting} to track and | |
| * verify the puts, deletes, and delete-range operations applied to RocksDB | |
| * column families. | |
| */ |
| // ops map key should be <<startKey, endKey>, 1> | ||
| testTable.deleteRangeWithBatch(batch, startKey, endKey); | ||
| testTable.putWithBatch(batch, startKey, value2); | ||
| testTable.putWithBatch(batch, keyInRange1, value2); | ||
| // ops map key is <<startKey, keyInRange1>, 2>. | ||
| testTable.deleteRangeWithBatch(batch, startKey, keyInRange1); | ||
| testTable.putWithBatch(batch, endKey, value1); | ||
| testTable.putWithBatch(batch, endKey, value2); | ||
| // ops map key is <<startKey, endKey>, 3>. |
There was a problem hiding this comment.
The comments about the internal “ops map key” (e.g., <<startKey, endKey>, 1>) no longer match the current RDBBatchOperation implementation (ops are keyed by an integer index and delete-ranges are separate ops). These comments are misleading and should be updated or removed to avoid coupling the test to stale internal details.
| // ops map key should be <<startKey, endKey>, 1> | |
| testTable.deleteRangeWithBatch(batch, startKey, endKey); | |
| testTable.putWithBatch(batch, startKey, value2); | |
| testTable.putWithBatch(batch, keyInRange1, value2); | |
| // ops map key is <<startKey, keyInRange1>, 2>. | |
| testTable.deleteRangeWithBatch(batch, startKey, keyInRange1); | |
| testTable.putWithBatch(batch, endKey, value1); | |
| testTable.putWithBatch(batch, endKey, value2); | |
| // ops map key is <<startKey, endKey>, 3>. | |
| testTable.deleteRangeWithBatch(batch, startKey, endKey); | |
| testTable.putWithBatch(batch, startKey, value2); | |
| testTable.putWithBatch(batch, keyInRange1, value2); | |
| testTable.deleteRangeWithBatch(batch, startKey, keyInRange1); | |
| testTable.putWithBatch(batch, endKey, value1); | |
| testTable.putWithBatch(batch, endKey, value2); |
| this.startKey = Objects.requireNonNull(startKey, "startKey == null"); | ||
| this.endKey = Objects.requireNonNull(endKey, "endKey == null"); | ||
| this.startKeyBytes = new Bytes(startKey); | ||
| this.endKeyBytes = new Bytes(endKey); |
There was a problem hiding this comment.
DeleteRangeOperation keeps references to the caller-provided startKey/endKey byte arrays. If those arrays are reused or mutated by the caller after enqueueing, the batch contents and range matching (contains) can become incorrect. Consider defensively copying the arrays (and using the copies for both the byte[] fields and the Bytes wrappers).
| this.startKey = Objects.requireNonNull(startKey, "startKey == null"); | |
| this.endKey = Objects.requireNonNull(endKey, "endKey == null"); | |
| this.startKeyBytes = new Bytes(startKey); | |
| this.endKeyBytes = new Bytes(endKey); | |
| Objects.requireNonNull(startKey, "startKey == null"); | |
| Objects.requireNonNull(endKey, "endKey == null"); | |
| this.startKey = startKey.clone(); | |
| this.endKey = endKey.clone(); | |
| this.startKeyBytes = new Bytes(this.startKey); | |
| this.endKeyBytes = new Bytes(this.endKey); |
| overwriteIfExists(new DeleteOp(key)); | ||
| } | ||
|
|
||
| void deleteRange(byte[] startKey, byte[] endKey) { |
There was a problem hiding this comment.
FamilyCache.deleteRange(...) bypasses the isCommit guard used by put/delete (via overwriteIfExists). If deleteRange is called after prepareBatchWrite() has started, the operation can be silently dropped (added after opList is built). Add the same Preconditions.checkState(!isCommit, ...) protection here.
| void deleteRange(byte[] startKey, byte[] endKey) { | |
| void deleteRange(byte[] startKey, byte[] endKey) { | |
| Preconditions.checkState(!isCommit, "%s is already committed.", this); |
What changes were proposed in this pull request?
The PrepareBatchOperation looping was very convoluted in HDDS-13415 (#8774) implementation. It also misses a case where a putKey/deleteKey can get added even though a deleteRange has been executed in the next batch after the following continuousDeleteRange batch. The following example will explain the scenario better.
Put Key1
DeleteRange Key2 - Key5
Put Key2
DeleteRange Key1 - Key4
Here the operation 1 should ideally be cancelled by Op4. But currently both Op1 & OP4 gets executed however it would be more optimal to just execute Op4 & Op1 is redundant.
Initially while implementing the deleteRangeWithBatch I had plans to optimize this the deleteRange search by sorting the continuous ranges and do a binary search to figure out the range matching the entry to reduce overall complexity. However we saw the code was getting more complex and not giving us much returns since deleteRange as an op was going to be infrequent and was only going to be used by snapshot create.
But the deleteRange is also going to be used by DirectoryPurgeRequest removing entries from the file table and directory table. This optimization would help reduce complexity of rocksdb compactions if a key is committed and also deleted within the same double buffer batch.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14239
How was this patch tested?
Updated unit tests. I intend to add another unit test suggested in this comment
#8774 (comment)