Skip to content

Conversation

@PavithranRick
Copy link
Contributor

@PavithranRick PavithranRick commented Feb 2, 2026

Describe the issue this Pull Request addresses

rollback gets triggered (internally) when clean() is called. Rollbacks are failed due to marker file creation retries. #18072

When using remote marker files, if a request times out but the marker is actually created successfully, the subsequent retry fails because the marker already exists. Retries in this scenario should succeed.

Summary and Changelog

This PR improves retry behavior for remote marker file creation by making marker operations idempotent.

  • Handles timeout scenarios where the marker was created but the client did not receive an acknowledgement.
  • Introduces a request ID (or similar tracking mechanism) to safely support idempotent retries without creating additional risks.

Impact

Reduces rollback failures (via cleaner) caused by non-idempotent marker retries and improves robustness of remote marker handling under transient network issues.

Risk Level

low — change is scoped to retry/idempotency behavior for remote marker requests. Verification includes ensuring retries succeed when markers already exist due to previous timed-out requests.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@PavithranRick PavithranRick changed the title GI-18072 - Allow retries on marker files fix: GI-18072 - Allow retries on marker files Feb 2, 2026
@PavithranRick PavithranRick changed the title fix: GI-18072 - Allow retries on marker files fix: GI-18072 - Allow retries on marker file creation requests Feb 2, 2026
@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 2, 2026
// Idempotent retry: marker already created with same requestId
if (markerToRequestIdMap.containsKey(markerName)) {
String existingRequestId = markerToRequestIdMap.get(markerName);
String normalizedRequestId = requestId == null ? NULL_REQUEST_ID : requestId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the requestId is null we should just skip this check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can incoming requestId could be null.
when this release goes out, any marker request to TLS should have a request Id set right?

Comment on lines 236 to 238
continue;
}
future.setIsSuccessful(!exists);
if (conflictDetectionStrategy.isPresent()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just simplify this to else if?


/**
* Syncs all markers maintained in the underlying files under the marker directory in the file system.
* When TLS recovers from crash, we have no request IDs so store sentinel value for each marker.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we cannot know what the requestID was, we can avoid backfilling this map with the NULL_REQUEST_ID and let the requests fail to be safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but to fail the request, we end up looking up in markerToRequestIdMap right. So, to even fail a marker request call to an already existing marker, we need to populate entries in the map.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we will allow the request to succeed if they are both NULL_REQUEST_ID which is different from existing behavior where we fail in this case. I think it is easier to reason about if we don't backfill this map.

@nsivabalan
Copy link
Contributor

one minor correction in PR desc. we are fixing the markers retries for rollbacks and not cleaner.
we do not have any markers for cleaner only.
just that, rollback gets triggered (internally) when clean() is called.

// Idempotent retry: marker already created with same requestId
if (markerToRequestIdMap.containsKey(markerName)) {
String existingRequestId = markerToRequestIdMap.get(markerName);
String normalizedRequestId = requestId == null ? NULL_REQUEST_ID : requestId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can incoming requestId could be null.
when this release goes out, any marker request to TLS should have a request Id set right?


/**
* Syncs all markers maintained in the underlying files under the marker directory in the file system.
* When TLS recovers from crash, we have no request IDs so store sentinel value for each marker.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but to fail the request, we end up looking up in markerToRequestIdMap right. So, to even fail a marker request call to an already existing marker, we need to populate entries in the map.


/**
* Syncs all markers maintained in the underlying files under the marker directory in the file system.
* When TLS recovers from crash, we have no request IDs so store sentinel value for each marker.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we will allow the request to succeed if they are both NULL_REQUEST_ID which is different from existing behavior where we fail in this case. I think it is easier to reason about if we don't backfill this map.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants