-
Notifications
You must be signed in to change notification settings - Fork 2.5k
fix: GI-18072 - Allow retries on marker file creation requests #18073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| // Idempotent retry: marker already created with same requestId | ||
| if (markerToRequestIdMap.containsKey(markerName)) { | ||
| String existingRequestId = markerToRequestIdMap.get(markerName); | ||
| String normalizedRequestId = requestId == null ? NULL_REQUEST_ID : requestId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the requestId is null we should just skip this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can incoming requestId could be null.
when this release goes out, any marker request to TLS should have a request Id set right?
| continue; | ||
| } | ||
| future.setIsSuccessful(!exists); | ||
| if (conflictDetectionStrategy.isPresent()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just simplify this to else if?
...-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java
Outdated
Show resolved
Hide resolved
|
|
||
| /** | ||
| * Syncs all markers maintained in the underlying files under the marker directory in the file system. | ||
| * When TLS recovers from crash, we have no request IDs so store sentinel value for each marker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we cannot know what the requestID was, we can avoid backfilling this map with the NULL_REQUEST_ID and let the requests fail to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but to fail the request, we end up looking up in markerToRequestIdMap right. So, to even fail a marker request call to an already existing marker, we need to populate entries in the map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we will allow the request to succeed if they are both NULL_REQUEST_ID which is different from existing behavior where we fail in this case. I think it is easier to reason about if we don't backfill this map.
|
one minor correction in PR desc. we are fixing the markers retries for rollbacks and not cleaner. |
| // Idempotent retry: marker already created with same requestId | ||
| if (markerToRequestIdMap.containsKey(markerName)) { | ||
| String existingRequestId = markerToRequestIdMap.get(markerName); | ||
| String normalizedRequestId = requestId == null ? NULL_REQUEST_ID : requestId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can incoming requestId could be null.
when this release goes out, any marker request to TLS should have a request Id set right?
...e-service/src/main/java/org/apache/hudi/timeline/service/handlers/marker/MarkerDirState.java
Outdated
Show resolved
Hide resolved
|
|
||
| /** | ||
| * Syncs all markers maintained in the underlying files under the marker directory in the file system. | ||
| * When TLS recovers from crash, we have no request IDs so store sentinel value for each marker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but to fail the request, we end up looking up in markerToRequestIdMap right. So, to even fail a marker request call to an already existing marker, we need to populate entries in the map.
...e-service/src/main/java/org/apache/hudi/timeline/service/handlers/marker/MarkerDirState.java
Show resolved
Hide resolved
...e-service/src/main/java/org/apache/hudi/timeline/service/handlers/marker/MarkerDirState.java
Outdated
Show resolved
Hide resolved
hudi-timeline-service/src/test/java/org/apache/hudi/timeline/service/TestRequestHandler.java
Outdated
Show resolved
Hide resolved
hudi-timeline-service/src/test/java/org/apache/hudi/timeline/service/TestRequestHandler.java
Outdated
Show resolved
Hide resolved
...e-service/src/main/java/org/apache/hudi/timeline/service/handlers/marker/MarkerDirState.java
Outdated
Show resolved
Hide resolved
|
|
||
| /** | ||
| * Syncs all markers maintained in the underlying files under the marker directory in the file system. | ||
| * When TLS recovers from crash, we have no request IDs so store sentinel value for each marker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we will allow the request to succeed if they are both NULL_REQUEST_ID which is different from existing behavior where we fail in this case. I think it is easier to reason about if we don't backfill this map.
Describe the issue this Pull Request addresses
rollback gets triggered (internally) when clean() is called. Rollbacks are failed due to marker file creation retries. #18072
When using remote marker files, if a request times out but the marker is actually created successfully, the subsequent retry fails because the marker already exists. Retries in this scenario should succeed.
Summary and Changelog
This PR improves retry behavior for remote marker file creation by making marker operations idempotent.
Impact
Reduces rollback failures (via cleaner) caused by non-idempotent marker retries and improves robustness of remote marker handling under transient network issues.
Risk Level
low — change is scoped to retry/idempotency behavior for remote marker requests. Verification includes ensuring retries succeed when markers already exist due to previous timed-out requests.
Documentation Update
none
Contributor's checklist