Skip to content

HDDS-14239. Simplify Prepare Batch looping for DeleteRange Op processing in RDBBatchOperation#9553

Draft
swamirishi wants to merge 4 commits intoapache:masterfrom
swamirishi:HDDS-14239
Draft

HDDS-14239. Simplify Prepare Batch looping for DeleteRange Op processing in RDBBatchOperation#9553
swamirishi wants to merge 4 commits intoapache:masterfrom
swamirishi:HDDS-14239

Conversation

@swamirishi
Copy link
Contributor

@swamirishi swamirishi commented Dec 24, 2025

What changes were proposed in this pull request?

The PrepareBatchOperation looping was very convoluted in HDDS-13415 (#8774) implementation. It also misses a case where a putKey/deleteKey can get added even though a deleteRange has been executed in the next batch after the following continuousDeleteRange batch. The following example will explain the scenario better.

  1. Put Key1

  2. DeleteRange Key2 - Key5

  3. Put Key2

  4. DeleteRange Key1 - Key4

Here the operation 1 should ideally be cancelled by Op4. But currently both Op1 & OP4 gets executed however it would be more optimal to just execute Op4 & Op1 is redundant.
Initially while implementing the deleteRangeWithBatch I had plans to optimize this the deleteRange search by sorting the continuous ranges and do a binary search to figure out the range matching the entry to reduce overall complexity. However we saw the code was getting more complex and not giving us much returns since deleteRange as an op was going to be infrequent and was only going to be used by snapshot create.
But the deleteRange is also going to be used by DirectoryPurgeRequest removing entries from the file table and directory table. This optimization would help reduce complexity of rocksdb compactions if a key is committed and also deleted within the same double buffer batch.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14239

How was this patch tested?

Updated unit tests. I intend to add another unit test suggested in this comment
#8774 (comment)

@github-actions
Copy link

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

@github-actions github-actions bot added the stale label Jan 17, 2026
@aryangupta1998
Copy link
Contributor

@swamirishi, are you working on this?

@swamirishi
Copy link
Contributor Author

@swamirishi, are you working on this?
I have a branch ready locally. I would push the code in some time after resolving some merge conflicts.

…nges and simple logic

Change-Id: I7a06f5a427ba7d1c4f760ff70fcb4093abaa3cbd
@swamirishi
Copy link
Contributor Author

@szetszwo @jojochuang Can you take a look at this change?

@swamirishi
Copy link
Contributor Author

@smengcl this needs to be review before #9690

@szetszwo
Copy link
Contributor

szetszwo commented Feb 6, 2026

@swamirishi , in our current usage, is it the case that DeleteRange won't mix with other SingleKeyOp? If it is true, we only have to support batch DeleteRange and batch SingleKeyOp separately but don't have to support mixing them together.

@swamirishi
Copy link
Contributor Author

swamirishi commented Feb 6, 2026

@swamirishi , in our current usage, is it the case that DeleteRange won't mix with other SingleKeyOp? If it is true, we only have to support batch DeleteRange and batch SingleKeyOp separately but don't have to support mixing them together.

No that is not the case @szetszwo it completely depends on the transactions that happen together in the same double buffer. This sort of intersection can happen even in the same DirectoryPurgeRequest, look at #9423 DirectoryDeletingService purge path request could be opening up deleted sub directory(move from DirectoryTable to deletedDirectoryTable) and also deleting the same entry from deletedDirectoryTable because of the optimizeDirDeleteAndSubmitRequest in the same transaction.
This would unnecessarily increase the rocksdb and compactions.

@swamirishi
Copy link
Contributor Author

This PR #9690 improves the efficiency of finding out the intersection from O(n) to log(n)

@szetszwo
Copy link
Contributor

szetszwo commented Feb 6, 2026

... it completely depends on the transactions that happen together in the same double buffer ...

@swamirishi , I agree. I mean that the current code does not have such transactions and we are not going to support such code.

... improves the efficiency ...

Let's talk about the correctness before talking about efficiency. As you mentioned in the description, the previously implementation (quoted below) indeed has a bug. So this change is very risky.

... It also misses a case where a putKey/deleteKey can get added even though a deleteRange has been executed in the next batch after the following continuousDeleteRange batch. ...

@swamirishi
Copy link
Contributor Author

swamirishi commented Feb 7, 2026

... it completely depends on the transactions that happen together in the same double buffer ...

@swamirishi , I agree. I mean that the current code does not have such transactions and we are not going to support such code.

... improves the efficiency ...

Let's talk about the correctness before talking about efficiency. As you mentioned in the description, the previously implementation (quoted below) indeed has a bug. So this change is very risky.

... It also misses a case where a putKey/deleteKey can get added even though a deleteRange has been executed in the next batch after the following continuousDeleteRange batch. ...

It was not a bug but just inefficient. We are good as long as the order of operations are correct.
For instance:
[PUT K1, DELETE RANGE K5- K6, PUT K5 DELETE RANGE K1- K4] is equivalent to [DELETE RANGE K5-K6, PUT K5, DELETE RANGE K1-K4]
So corresponding to the Single Key Op PUT K1 the previous implementation was only looking into the very next DeleteRange operation i.e. DELETE RANGE K5-K6 and was performing the first set of operations if there is no intersection and the implementation proposed in this PR looks into optimizing it further and peeks into all the Delete Range that come after i.e. after PUT K1 there are 2 DELETE Range K5-K6 and DELETE RANGE K1-K4 and K1 has an intersection with DELETE RANGE K1-K4 so the operation PUT K1 can be skipped even if it is performed it is not an issue but just redundant. So the issue was never around correctness and doesn't constitute as a bug.

@swamirishi
Copy link
Contributor Author

swamirishi commented Feb 7, 2026

... it completely depends on the transactions that happen together in the same double buffer ...

@swamirishi , I agree. I mean that the current code does not have such transactions and we are not going to support such code.

... improves the efficiency ...

Let's talk about the correctness before talking about efficiency. As you mentioned in the description, the previously implementation (quoted below) indeed has a bug. So this change is very risky.

... It also misses a case where a putKey/deleteKey can get added even though a deleteRange has been executed in the next batch after the following continuousDeleteRange batch. ...

It was not a bug but just inefficient. We are good as long as the order of operations are correct. For instance: [PUT K1, DELETE RANGE K5- K6, PUT K5 DELETE RANGE K1- K4] is equivalent to [DELETE RANGE K5-K6, PUT K5, DELETE RANGE K1-K4] So corresponding to the Single Key Op PUT K1 the previous implementation was only looking into the very next DeleteRange operation i.e. DELETE RANGE K5-K6 and was performing the first set of operations if there is no intersection and the implementation proposed in this PR looks into optimizing it further and peeks into all the Delete Range that come after i.e. after PUT K1 there are 2 DELETE Range K5-K6 and DELETE RANGE K1-K4 and K1 has an intersection with DELETE RANGE K1-K4 so the operation PUT K1 can be skipped even if it is performed it is not an issue but just redundant. So the issue was never around correctness.

We had done it the earlier way consciously since the DELETE RANGE was only being used for Snapshot CREATE and there was going to be no intersection of DELETE RANGE with any of the operation beyond one transaction since SNAPSHOT CREATE splits the double buffer batch. This changes proposed here is for more generic use cases like @aryangupta1998 's PR change which is also a safe change. The earlier implementation had an Order of O(n + k) and with this PR we would make it O(n * k) where n is the number of Single Key Op and k is number of Delete range operations in a batch(P.S. we don't anticipate having a lot of delete range in one batch K should be ideally much smaller than n). The next PR brings this efficiency down to O(n Log k).

@szetszwo
Copy link
Contributor

szetszwo commented Feb 7, 2026

@swamirishi , sorry that this does not look like safe change since it dramatically changes RDBBatchOperation. In particular, this PR adds the new Map<Integer, Op> ops to support a unneeded case mixing DeleteRange and SingleKeyOp in a batch.

  • I propose support pure batch DeleteRange (but not mixing other SingleKeyOp). The code would be much simpler.

@swamirishi
Copy link
Contributor Author

swamirishi commented Feb 7, 2026

@swamirishi , sorry that this does not look like safe change since it dramatically changes RDBBatchOperation. In particular, this PR adds the new Map<Integer, Op> ops to support a unneeded case mixing DeleteRange and SingleKeyOp in a batch.

  • I propose support pure batch DeleteRange (but not mixing other SingleKeyOp). The code would be much simpler.

That is not even possible. It means double buffer would have to understand how the DBResponse interface is going to flush the transaction on a db. One transaction can do both put, delete and delete range. We can never guarantee one transaction is going to have only delete range operation. All of them need to be flushed in the same rocksdb batch for atomicity guarantees.

@szetszwo
Copy link
Contributor

szetszwo commented Feb 7, 2026

That is not even possible. ...

Today, we don't have batch DeleteRange and everything works. If we add pure batch DeleteRange tomorrow, would anything become not working?

Change-Id: Ia3db219dab70a7a2faf82009e8f0dac59c665938
Change-Id: Icae0e1299fcb7d37bd5a0f3c2f405537b1d68f3e
@swamirishi
Copy link
Contributor Author

swamirishi commented Feb 7, 2026

That is not even possible. ...

Today, we don't have batch DeleteRange and everything works. If we add pure batch DeleteRange tomorrow, would anything become not working?

We don't have deleteRange because we reverted #8774 and snapshot create was using delete range operation. Without this patch we are going to kill snapshot create performance and the optimization to improve rocksdb performance in case of huge deletes.

@szetszwo
Copy link
Contributor

szetszwo commented Feb 8, 2026

... snapshot create was using delete range operation. ...

I see you point. How about extending RDBBatchOperation in a subclass, say RDBBatchOperationWithDeleteRange? The new code goes to RDBBatchOperationWithDeleteRange while keeping RDBBatchOperation unchanged.

In case there is a bug, people not using OM snapshot won't be affected.

@swamirishi
Copy link
Contributor Author

swamirishi commented Feb 8, 2026

I see you point. How about extending RDBBatchOperation in a subclass, say RDBBatchOperationWithDeleteRange? The new code goes to RDBBatchOperationWithDeleteRange while keeping RDBBatchOperation unchanged.

Are you saying we have separate implementation and make it configurable? Then it would mean we would have to branch the logic OmResponse where it would have to understand which implementation has been passed

protected abstract void addToDBBatch(OMMetadataManager omMetadataManager,
BatchOperation batchOperation) throws IOException;

@szetszwo
Copy link
Contributor

szetszwo commented Feb 9, 2026

Are you saying we have separate implementation and make it configurable?

Configurable is a good idea:

  • People not using snapshot can use RDBBatchOperation.
  • Even people using snapshot could choose to use RDBBatchOperation (DeleteRange can be implemented by iterating Delete). This is very useful in case there is a bug in the new code.

Then it would mean we would have to branch the logic OmResponse where it would have to understand which implementation has been passed

No. We could add a RDBBatchOperationFactory.

<ranger.version>2.7.0</ranger.version>
<!-- versions included in ratis-thirdparty, update in sync -->
<ratis-thirdparty.grpc.version>1.75.0</ratis-thirdparty.grpc.version>
<ratis-thirdparty.grpc.version>1.77.0</ratis-thirdparty.grpc.version>
Copy link
Contributor

@smengcl smengcl Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bump doesn't look to be required in this PR?

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

@github-actions github-actions bot added the stale label Mar 6, 2026
@jojochuang jojochuang added snapshot https://issues.apache.org/jira/browse/HDDS-6517 and removed stale labels Mar 9, 2026
@smengcl smengcl requested a review from Copilot March 24, 2026 19:57
@smengcl
Copy link
Contributor

smengcl commented Mar 24, 2026

@swamirishi Is this ready for review? Is this marked as draft only for CI reason?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR simplifies and optimizes RocksDB batch preparation when DeleteRange operations are present, and exposes deleteRangeWithBatch through the Table API so callers (eg, OM snapshot cleanup) can use native RocksDB range deletes instead of iterating keys.

Changes:

  • Add Table#deleteRangeWithBatch(...) and implement it across TypedTable, RDBTable, and wrappers.
  • Extend RDBBatchOperation to support deleteRange(...) and to discard earlier single-key ops that are made redundant by later delete-range ops.
  • Update OM snapshot bucket-prefix cleanup to use a batch delete-range; add/adjust unit tests for delete-range behavior and op ordering.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pom.xml Bumps ratis-thirdparty gRPC version.
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java Switches prefix deletions to deleteRangeWithBatch instead of iterating keys.
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/Table.java Adds new deleteRangeWithBatch API to Table.
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/TypedTable.java Implements deleteRangeWithBatch by delegating to raw table with encoded keys.
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBTable.java Implements deleteRangeWithBatch via RDBBatchOperation.deleteRange.
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RocksDatabase.java Adds ColumnFamily.batchDeleteRange(...) helper.
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBBatchOperation.java Adds delete-range op type and updates batch preparation to account for delete-range redundancy.
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/metadata/DatanodeTable.java Wires through the new deleteRangeWithBatch API in the wrapper.
hadoop-hdds/framework/src/test/java/org/apache/hadoop/hdds/utils/db/InMemoryTestTable.java Adds the new method (unsupported in this in-memory test table).
hadoop-hdds/framework/src/test/java/org/apache/hadoop/hdds/utils/db/TestRDBTableStore.java Adds tests validating delete-range batch semantics and ordering.
hadoop-hdds/framework/src/test/java/org/apache/hadoop/hdds/utils/db/TestRDBBatchOperation.java Updates batch-operation tests to include delete-range behavior and randomized coverage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 535 to 558
String getCommitString() {
int putCount = 0;
int delCount = 0;
int opSize = 0;
int discardedCount = 0;
int discardedSize = 0;
int delRangeCount = 0;

for (FamilyCache f : name2cache.values()) {
putCount += f.putCount;
delCount += f.delCount;
opSize += f.batchSize;
discardedCount += f.discardedCount;
discardedSize += f.discardedSize;
delRangeCount += f.delRangeCount;
}

final int opCount = putCount + delCount;
return String.format(
"#put=%s, #del=%s, batchSize: %s, discarded: %s, committed: %s",
putCount, delCount,
"#put=%s, #del=%s, #delRange=%s, batchSize: %s, discarded: %s, committed: %s",
putCount, delCount, delRangeCount,
countSize2String(opCount, opSize),
countSize2String(discardedCount, discardedSize),
countSize2String(opCount - discardedCount, opSize - discardedSize));
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FamilyCache accounting/logging does not include delete-range ops: deleteRange(...) doesn't update batchSize, and getCommitString() computes opCount as putCount + delCount (excluding delRangeCount). This makes commit stats (and the discarded-changes warning) inaccurate, especially for batches containing only delete-range operations. Track delete-range sizes in batchSize and include delRangeCount in the operation counts used for logging.

Copilot uses AI. Check for mistakes.

void deleteRange(byte[] startKey, byte[] endKey) {
delRangeCount++;
ops.put(opIndex.getAndIncrement(), new DeleteRangeOperation(startKey, endKey));
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleteRange(...) increments delRangeCount but does not update batchSize/discard accounting. This makes batchSizeDiscardedString() and the “discarding changes” warning unreliable for batches that include delete-range ops. Consider adding the range op’s totalLength() to batchSize (and/or tracking delete-range bytes separately) for consistent accounting.

Suggested change
ops.put(opIndex.getAndIncrement(), new DeleteRangeOperation(startKey, endKey));
DeleteRangeOperation op = new DeleteRangeOperation(startKey, endKey);
batchSize += op.totalLength();
ops.put(opIndex.getAndIncrement(), op);

Copilot uses AI. Check for mistakes.
Comment on lines +60 to +72
* Test class for verifying batch operations with delete ranges using the
* RDBBatchOperation and MockedConstruction of ManagedWriteBatch.
*
* This test class includes:
* - Mocking and tracking of operations including put, delete, and delete range
* within a batch operation.
* - Validation of committed operations using assertions on collected data.
* - Ensures that the batch operation interacts correctly with the
* RocksDatabase and ColumnFamilyHandle components.
*
* The test method includes:
* 1. Setup of mocked ColumnFamilyHandle and RocksDatabase.ColumnFamily.
* 2. Mocking of methods to track operations performed on*/
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class-level Javadoc is incomplete/inaccurate (it ends mid-sentence at “performed on” and mentions “MockedConstruction of ManagedWriteBatch” even though the test uses TrackingUtilManagedWriteBatchForTesting). Please fix/trim the Javadoc so it accurately reflects the current test setup.

Suggested change
* Test class for verifying batch operations with delete ranges using the
* RDBBatchOperation and MockedConstruction of ManagedWriteBatch.
*
* This test class includes:
* - Mocking and tracking of operations including put, delete, and delete range
* within a batch operation.
* - Validation of committed operations using assertions on collected data.
* - Ensures that the batch operation interacts correctly with the
* RocksDatabase and ColumnFamilyHandle components.
*
* The test method includes:
* 1. Setup of mocked ColumnFamilyHandle and RocksDatabase.ColumnFamily.
* 2. Mocking of methods to track operations performed on*/
* Tests batch operations (including delete ranges) using {@link RDBBatchOperation}
* together with {@link TrackingUtilManagedWriteBatchForTesting} to track and
* verify the puts, deletes, and delete-range operations applied to RocksDB
* column families.
*/

Copilot uses AI. Check for mistakes.
Comment on lines +832 to +840
// ops map key should be <<startKey, endKey>, 1>
testTable.deleteRangeWithBatch(batch, startKey, endKey);
testTable.putWithBatch(batch, startKey, value2);
testTable.putWithBatch(batch, keyInRange1, value2);
// ops map key is <<startKey, keyInRange1>, 2>.
testTable.deleteRangeWithBatch(batch, startKey, keyInRange1);
testTable.putWithBatch(batch, endKey, value1);
testTable.putWithBatch(batch, endKey, value2);
// ops map key is <<startKey, endKey>, 3>.
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments about the internal “ops map key” (e.g., <<startKey, endKey>, 1>) no longer match the current RDBBatchOperation implementation (ops are keyed by an integer index and delete-ranges are separate ops). These comments are misleading and should be updated or removed to avoid coupling the test to stale internal details.

Suggested change
// ops map key should be <<startKey, endKey>, 1>
testTable.deleteRangeWithBatch(batch, startKey, endKey);
testTable.putWithBatch(batch, startKey, value2);
testTable.putWithBatch(batch, keyInRange1, value2);
// ops map key is <<startKey, keyInRange1>, 2>.
testTable.deleteRangeWithBatch(batch, startKey, keyInRange1);
testTable.putWithBatch(batch, endKey, value1);
testTable.putWithBatch(batch, endKey, value2);
// ops map key is <<startKey, endKey>, 3>.
testTable.deleteRangeWithBatch(batch, startKey, endKey);
testTable.putWithBatch(batch, startKey, value2);
testTable.putWithBatch(batch, keyInRange1, value2);
testTable.deleteRangeWithBatch(batch, startKey, keyInRange1);
testTable.putWithBatch(batch, endKey, value1);
testTable.putWithBatch(batch, endKey, value2);

Copilot uses AI. Check for mistakes.
Comment on lines +263 to +266
this.startKey = Objects.requireNonNull(startKey, "startKey == null");
this.endKey = Objects.requireNonNull(endKey, "endKey == null");
this.startKeyBytes = new Bytes(startKey);
this.endKeyBytes = new Bytes(endKey);
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeleteRangeOperation keeps references to the caller-provided startKey/endKey byte arrays. If those arrays are reused or mutated by the caller after enqueueing, the batch contents and range matching (contains) can become incorrect. Consider defensively copying the arrays (and using the copies for both the byte[] fields and the Bytes wrappers).

Suggested change
this.startKey = Objects.requireNonNull(startKey, "startKey == null");
this.endKey = Objects.requireNonNull(endKey, "endKey == null");
this.startKeyBytes = new Bytes(startKey);
this.endKeyBytes = new Bytes(endKey);
Objects.requireNonNull(startKey, "startKey == null");
Objects.requireNonNull(endKey, "endKey == null");
this.startKey = startKey.clone();
this.endKey = endKey.clone();
this.startKeyBytes = new Bytes(this.startKey);
this.endKeyBytes = new Bytes(this.endKey);

Copilot uses AI. Check for mistakes.
overwriteIfExists(new DeleteOp(key));
}

void deleteRange(byte[] startKey, byte[] endKey) {
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FamilyCache.deleteRange(...) bypasses the isCommit guard used by put/delete (via overwriteIfExists). If deleteRange is called after prepareBatchWrite() has started, the operation can be silently dropped (added after opList is built). Add the same Preconditions.checkState(!isCommit, ...) protection here.

Suggested change
void deleteRange(byte[] startKey, byte[] endKey) {
void deleteRange(byte[] startKey, byte[] endKey) {
Preconditions.checkState(!isCommit, "%s is already committed.", this);

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants