Skip to content

Commit 214c90b

Browse files
Replace async-disabling mechanism with retry backoff on refresh failure (#696)
## Summary Replace the cache's staleness tracking with a cached `staleAfter` timestamp and preserve the legacy `staleDuration` builder path. Async refresh failures now use a short retry backoff instead of suppressing background refresh until the token expires. ## Why `CachedTokenSource` previously mixed relative stale-window calculations with token expiry checks, which made the cache state harder to reason about and made it easier to regress callers that explicitly configure `staleDuration` through the builder. At the same time, a failed async refresh effectively disabled further async refreshes until a blocking refresh happened on expiry, which meant a brief transient failure could prevent the SDK from recovering proactively for the rest of the token lifetime. This PR moves the cache to an absolute `staleAfter` threshold that is computed whenever a token is stored. That makes the state model much clearer: callers now classify tokens by comparing the current time against `staleAfter` and the expiry buffer. It also preserves the legacy fixed-window behavior for callers that set `staleDuration`, and replaces the old async-failure suppression behavior with a one-minute retry backoff so transient failures can recover without waiting for full expiry. ## What changed ### Interface changes None. ### Behavioral changes Calls that set `staleDuration` through `CachedTokenSource.Builder#setStaleDuration(...)` continue to get the legacy fixed-window behavior. The cache now recomputes `staleAfter` from that configured duration each time a token is stored, so both the initial cached token and later refreshed tokens honor the same caller-provided setting. Failed async refreshes no longer block future async refresh attempts until a blocking refresh on expiry. Instead, the cache moves `staleAfter` one minute into the future, treats the token as fresh during that cooldown, and retries async refresh the next time the token becomes stale again. Older async refresh results are also ignored when the cache already holds a newer token. This prevents a late async refresh from overwriting a token with a later expiry that was installed by another refresh path. ### Internal changes `CachedTokenSource` now stores an absolute `staleAfter` instant rather than reasoning about staleness from relative durations at read time. When callers do not provide `staleDuration`, the stale threshold is derived from the token's remaining TTL at the moment the token is stored and capped at 20 minutes. This centralizes stale-threshold computation in `updateToken()` and reduces token-state checks to direct time comparisons. The implementation adds `_ASYNC_REFRESH_RETRY_BACKOFF`-equivalent behavior for Java via `ASYNC_REFRESH_RETRY_BACKOFF`, introduces `handleFailedAsyncRefresh()` to apply the retry cooldown, and adds `cachedTokenIsNewer()` to discard async refresh results that would otherwise roll the cache back to an older token. `CachedTokenSourceTest` was updated to replace the old async-failure fallback coverage with parameterized `staleAfter` initialization coverage and explicit retry-backoff tests. ## How is this tested? Ran the focused unit test suite for `CachedTokenSource` and the module formatting check locally: - `mvn --errors -pl databricks-sdk-java -Dtest=CachedTokenSourceTest test` - `mvn --errors -pl databricks-sdk-java spotless:check` Test coverage now includes: - Existing async refresh state coverage in `testAsyncRefreshParametrized` - `testStaleAfterComputationParametrized` for `staleAfter` initialization across legacy builder-provided `staleDuration`, default computed thresholds, null initial tokens, and expired tokens - `testGetTokenDoesNotRetryBeforeAsyncBackoffElapses` to verify repeated reads during the cooldown do not trigger additional refreshes - `testGetTokenRetriesAfterAsyncBackoffElapsesAndUpdatesToken` to verify a read after the cooldown elapses starts a new async refresh and updates the cached token --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent f518c27 commit 214c90b

File tree

3 files changed

+391
-156
lines changed

3 files changed

+391
-156
lines changed

NEXT_CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,5 +11,6 @@
1111
### Documentation
1212

1313
### Internal Changes
14+
* Add retry with backoff to `CachedTokenSource` async refresh so that a failed background refresh no longer disables async until a blocking call succeeds.
1415

1516
### API Changes

databricks-sdk-java/src/main/java/com/databricks/sdk/core/oauth/CachedTokenSource.java

Lines changed: 109 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -38,21 +38,23 @@ private enum TokenState {
3838
// monthly downtime allowed by a 99.99% uptime SLA (~4.38 minutes) while increasing the likelihood
3939
// that the token is refreshed asynchronously if the auth server is down.
4040
private static final Duration MAX_STALE_DURATION = Duration.ofMinutes(20);
41+
// Delay before another async refresh may be attempted after an async refresh failure.
42+
private static final Duration ASYNC_REFRESH_RETRY_BACKOFF = Duration.ofMinutes(1);
4143
// Default additional buffer before expiry to consider a token as expired.
4244
// This is 40 seconds by default since Azure Databricks rejects tokens that are within 30 seconds
4345
// of expiry.
4446
private static final Duration DEFAULT_EXPIRY_BUFFER = Duration.ofSeconds(40);
4547

46-
// The token source to use for refreshing the token.
48+
// Underlying token source used to fetch replacement tokens.
4749
private final TokenSource tokenSource;
4850
// Whether asynchronous refresh is enabled.
4951
private boolean asyncDisabled = false;
5052
// The legacy duration before expiry to consider a token as 'stale'.
5153
private final Duration staticStaleDuration;
5254
// Whether to use the dynamic stale duration computation or defer to the legacy duration.
53-
private final boolean useDynamicStaleDuration;
54-
// The dynamically computed duration before expiry to consider a token as 'stale'.
55-
private volatile Duration dynamicStaleDuration;
55+
private final boolean useLegacyStaleDuration;
56+
// The earliest time at which the cached token should be considered stale.
57+
private volatile Instant staleAfter;
5658
// Additional buffer before expiry to consider a token as expired.
5759
private final Duration expiryBuffer;
5860
// Clock supplier for current time.
@@ -62,23 +64,16 @@ private enum TokenState {
6264
protected volatile Token token;
6365
// Whether a refresh is currently in progress (for async refresh).
6466
private boolean refreshInProgress = false;
65-
// Whether the last refresh attempt succeeded.
66-
private boolean lastRefreshSucceeded = true;
6767

6868
private CachedTokenSource(Builder builder) {
6969
this.tokenSource = builder.tokenSource;
7070
this.asyncDisabled = builder.asyncDisabled;
7171
this.staticStaleDuration = builder.staleDuration;
72-
this.useDynamicStaleDuration = builder.useDynamicStaleDuration;
72+
this.useLegacyStaleDuration = builder.useLegacyStaleDuration;
7373
this.expiryBuffer = builder.expiryBuffer;
7474
this.clockSupplier = builder.clockSupplier;
75-
this.token = builder.token;
7675

77-
if (this.useDynamicStaleDuration && this.token != null) {
78-
this.dynamicStaleDuration = computeStaleDuration(this.token);
79-
} else {
80-
this.dynamicStaleDuration = Duration.ofMinutes(0);
81-
}
76+
this.updateToken(builder.token);
8277
}
8378

8479
/**
@@ -91,7 +86,7 @@ public static class Builder {
9186
private final TokenSource tokenSource;
9287
private boolean asyncDisabled = false;
9388
private Duration staleDuration = DEFAULT_STALE_DURATION;
94-
private boolean useDynamicStaleDuration = true;
89+
private boolean useLegacyStaleDuration = false;
9590
private Duration expiryBuffer = DEFAULT_EXPIRY_BUFFER;
9691
private ClockSupplier clockSupplier = new UtcClockSupplier();
9792
private Token token;
@@ -139,15 +134,18 @@ public Builder setAsyncDisabled(boolean asyncDisabled) {
139134
* Sets the duration before token expiry at which the token is considered stale.
140135
*
141136
* <p>When asynchronous refresh is enabled, tokens that are stale but not yet expired will
142-
* trigger a background refresh while continuing to serve the current token.
137+
* trigger a background refresh while continuing to serve the current token. Calling this method
138+
* opts into the legacy fixed stale-duration behavior instead of the default dynamic stale
139+
* computation, preserving backward compatibility for callers that already provide a custom
140+
* stale duration.
143141
*
144142
* @param staleDuration The duration before expiry to consider a token stale. Must be greater
145143
* than the expiry buffer duration.
146144
* @return This builder instance for method chaining.
147145
*/
148146
public Builder setStaleDuration(Duration staleDuration) {
149147
this.staleDuration = staleDuration;
150-
this.useDynamicStaleDuration = false;
148+
this.useLegacyStaleDuration = true;
151149
return this;
152150
}
153151

@@ -190,6 +188,71 @@ public CachedTokenSource build() {
190188
}
191189
}
192190

191+
/**
192+
* Replaces the cached token and recomputes the time after which it should be treated as stale.
193+
*
194+
* <p>Legacy mode uses the configured fixed stale duration. Dynamic mode derives the stale window
195+
* from the token's remaining TTL and caps it at {@link #MAX_STALE_DURATION}. The stale threshold
196+
* is written before the volatile token write so readers that observe the new token also observe
197+
* the matching {@code staleAfter} value.
198+
*
199+
* @param t The token to cache. May be null.
200+
*/
201+
private void updateToken(Token t) {
202+
if (t == null || t.getExpiry() == null) {
203+
this.staleAfter = null;
204+
this.token = t;
205+
return;
206+
}
207+
208+
if (this.useLegacyStaleDuration) {
209+
this.staleAfter = t.getExpiry().minus(staticStaleDuration);
210+
} else {
211+
Duration ttl = Duration.between(Instant.now(clockSupplier.getClock()), t.getExpiry());
212+
Duration staleDuration = ttl.dividedBy(2);
213+
if (staleDuration.compareTo(MAX_STALE_DURATION) > 0) {
214+
staleDuration = MAX_STALE_DURATION;
215+
}
216+
if (staleDuration.compareTo(Duration.ZERO) <= 0) {
217+
staleDuration = Duration.ZERO;
218+
}
219+
220+
this.staleAfter = t.getExpiry().minus(staleDuration);
221+
}
222+
223+
// Publish the token after staleAfter so readers that observe the new token also observe the
224+
// stale threshold computed for that token. Note: handleFailedAsyncRefresh writes staleAfter
225+
// without a subsequent volatile token write, so a concurrent reader may briefly see a stale
226+
// staleAfter value; the only consequence is one extra async trigger, which is harmless.
227+
this.token = t;
228+
}
229+
230+
/**
231+
* Delays the next async refresh attempt after an async refresh failure.
232+
*
233+
* <p>The cached token remains usable until it becomes expired. Moving {@code staleAfter} into the
234+
* future prevents callers from immediately retrying async refresh on every stale read while the
235+
* auth service is unhealthy.
236+
*/
237+
private void handleFailedAsyncRefresh() {
238+
if (this.staleAfter != null) {
239+
Instant now = Instant.now(clockSupplier.getClock());
240+
this.staleAfter = now.plus(ASYNC_REFRESH_RETRY_BACKOFF);
241+
}
242+
}
243+
244+
/**
245+
* Returns {@code true} when the currently cached token has a later expiry than {@code candidate},
246+
* meaning the candidate should be discarded. This prevents an async refresh that was started
247+
* before a blocking refresh from overwriting the newer token obtained by the blocking path.
248+
*/
249+
private boolean cachedTokenIsNewer(Token candidate) {
250+
return token != null
251+
&& token.getExpiry() != null
252+
&& candidate.getExpiry() != null
253+
&& token.getExpiry().isAfter(candidate.getExpiry());
254+
}
255+
193256
/**
194257
* Gets the current token, refreshing if necessary. If async refresh is enabled, may return a
195258
* stale token while a refresh is in progress.
@@ -206,21 +269,6 @@ public Token getToken() {
206269
return getTokenAsync();
207270
}
208271

209-
private Duration computeStaleDuration(Token t) {
210-
if (t.getExpiry() == null) {
211-
return Duration.ZERO; // Tokens with no expiry are considered permanent.
212-
}
213-
214-
Duration ttl = Duration.between(Instant.now(clockSupplier.getClock()), t.getExpiry());
215-
216-
if (ttl.compareTo(Duration.ZERO) <= 0) {
217-
return Duration.ZERO;
218-
}
219-
220-
Duration halfTtl = ttl.dividedBy(2);
221-
return halfTtl.compareTo(MAX_STALE_DURATION) > 0 ? MAX_STALE_DURATION : halfTtl;
222-
}
223-
224272
/**
225273
* Determine the state of the current token (fresh, stale, or expired).
226274
*
@@ -234,12 +282,11 @@ protected TokenState getTokenState(Token t) {
234282
return TokenState.FRESH; // Tokens with no expiry are considered permanent.
235283
}
236284

237-
Duration lifeTime = Duration.between(Instant.now(clockSupplier.getClock()), t.getExpiry());
238-
if (lifeTime.compareTo(expiryBuffer) <= 0) {
285+
Instant now = Instant.now(clockSupplier.getClock());
286+
if (now.isAfter(t.getExpiry().minus(expiryBuffer))) {
239287
return TokenState.EXPIRED;
240288
}
241-
Duration staleDuration = useDynamicStaleDuration ? dynamicStaleDuration : staticStaleDuration;
242-
if (lifeTime.compareTo(staleDuration) <= 0) {
289+
if (now.isAfter(staleAfter)) {
243290
return TokenState.STALE;
244291
}
245292
return TokenState.FRESH;
@@ -265,23 +312,15 @@ protected Token getTokenBlocking() {
265312
if (getTokenState(token) != TokenState.EXPIRED) {
266313
return token;
267314
}
268-
lastRefreshSucceeded = false;
269315
Token newToken;
270316
try {
271317
newToken = tokenSource.getToken();
272318
} catch (Exception e) {
273319
logger.error("Failed to refresh token synchronously", e);
274320
throw e;
275321
}
276-
lastRefreshSucceeded = true;
277322

278-
// Write dynamicStaleDuration before publishing the new token via the volatile write,
279-
// so unsynchronized readers that see the new token are guaranteed to also see the
280-
// updated dynamicStaleDuration.
281-
if (useDynamicStaleDuration && newToken != null) {
282-
dynamicStaleDuration = computeStaleDuration(newToken);
283-
}
284-
token = newToken;
323+
updateToken(newToken);
285324
return token;
286325
}
287326
}
@@ -316,33 +355,33 @@ protected Token getTokenAsync() {
316355
* succeeded.
317356
*/
318357
private synchronized void triggerAsyncRefresh() {
319-
// Check token state again inside the synchronized block to avoid triggering a refresh if
320-
// another thread updated the token in the meantime.
321-
if (!refreshInProgress && lastRefreshSucceeded && getTokenState(token) != TokenState.FRESH) {
322-
refreshInProgress = true;
323-
CompletableFuture.runAsync(
324-
() -> {
325-
try {
326-
// Attempt to refresh the token in the background.
327-
Token newToken = tokenSource.getToken();
328-
synchronized (this) {
329-
// Write dynamicStaleDuration before publishing the new token via the volatile
330-
// write, so unsynchronized readers that see the new token are guaranteed to also
331-
// see the updated dynamicStaleDuration.
332-
if (useDynamicStaleDuration && newToken != null) {
333-
dynamicStaleDuration = computeStaleDuration(newToken);
334-
}
335-
token = newToken;
336-
refreshInProgress = false;
337-
}
338-
} catch (Exception e) {
339-
synchronized (this) {
340-
lastRefreshSucceeded = false;
341-
refreshInProgress = false;
342-
logger.error("Asynchronous token refresh failed", e);
358+
// Re-check inside the synchronized block: another thread may have updated the token.
359+
// Only STALE triggers async refresh; EXPIRED tokens are handled by getTokenBlocking, so
360+
// an async attempt is unnecessary and would race with the blocking path.
361+
if (refreshInProgress || getTokenState(token) != TokenState.STALE) {
362+
return;
363+
}
364+
365+
refreshInProgress = true;
366+
CompletableFuture.runAsync(
367+
() -> {
368+
try {
369+
Token newToken = tokenSource.getToken();
370+
synchronized (this) {
371+
if (newToken != null && !cachedTokenIsNewer(newToken)) {
372+
updateToken(newToken);
343373
}
344374
}
345-
});
346-
}
375+
} catch (Exception e) {
376+
synchronized (this) {
377+
handleFailedAsyncRefresh();
378+
logger.error("Asynchronous token refresh failed", e);
379+
}
380+
} finally {
381+
synchronized (this) {
382+
refreshInProgress = false;
383+
}
384+
}
385+
});
347386
}
348387
}

0 commit comments

Comments
 (0)