Skip to content

Fix rotating secret with TTL=0 to use exponential backoff retry instead of 1 sec#2134

Closed
aswanidutt wants to merge 14 commits intomainfrom
aswanidutt/VAULT-38829-Vault-Agent-does-not-back-off
Closed

Fix rotating secret with TTL=0 to use exponential backoff retry instead of 1 sec#2134
aswanidutt wants to merge 14 commits intomainfrom
aswanidutt/VAULT-38829-Vault-Agent-does-not-back-off

Conversation

@aswanidutt
Copy link
Collaborator

@aswanidutt aswanidutt commented Feb 11, 2026

issue: When a rotating secret that has rotation_period but ttl=0, it should not be treated as a rotating secret. Instead, it should wait and retry exponentially until original LeaseDuration with non-renewable lease logic (80-95% threshold)

  • This prevents the bug where sleep time would be 0 seconds ( +1 second of cushion) and constant polling every second to rotate

Changes:

  • Added maxJitter constant (50%)
  • Applied jitter to exponential backoff for rotating secrets with ttl=0:
  • Exponential backoff: 2s, 4s, 8s, 16s, 32s, 64s...Max backoff: 300 seconds (when VaultDefaultLeaseDuration not configured)
  • Jitter applied after capping to prevent thundering herd
  • Updated test to validate jittered values are within 90-100% of expected

Fixes: VAULT-38829-https://hashicorp.atlassian.net/browse/VAULT-38829?

before this change: checking to rotate every second

image

after the code change : checking every Exponential backoff: 2s, 4s, 8s, 16s, 32s, 64s, the max is VaultDefaultLeaseDuration set on consul-template

Screenshot 2026-02-16 at 11 16 07 AM

aswanidutt and others added 8 commits February 11, 2026 15:49
- When a rotating secret has rotation_period but ttl=0, it should not be
  treated as a rotating secret
- Instead, it should fall back to using the original LeaseDuration with
  non-renewable lease logic (80-95% threshold)
- This prevents the bug where sleep time would be 0 seconds

Changes:
- vault_common.go: Only set rotatingSecret=true when ttl > 0
- vault_common_test.go: Add test case for rotating secret with ttl=0

Fixes: VAULT-38829
@tvoran
Copy link
Member

tvoran commented Feb 12, 2026

Can you separate the unrelated fixes into a separate PR? Like the race condition, CWE-338, etc.

@aswanidutt aswanidutt marked this pull request as ready for review February 13, 2026 18:19
@aswanidutt aswanidutt requested review from a team as code owners February 13, 2026 18:19
@aswanidutt
Copy link
Collaborator Author

need to merge #2135 first before this PR

@aswanidutt
Copy link
Collaborator Author

Can you separate the unrelated fixes into a separate PR? Like the race condition, CWE-338, etc.

#2135 added for the above unrelated fixes

@aswanidutt aswanidutt changed the title Fix rotating secret with TTL=0 to use lease duration Fix rotating secret with TTL=0 to use exponential back retry instead of 1 sec Feb 13, 2026
@aswanidutt aswanidutt changed the title Fix rotating secret with TTL=0 to use exponential back retry instead of 1 sec Fix rotating secret with TTL=0 to use exponential backoff retry instead of 1 sec Feb 13, 2026
Copy link
Member

@tvoran tvoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple questions. It's also a little strange to me that the TestFileQuery_Fetch/fires_changes test is failing on this PR, when it seems to have passed for recent commits in this repo.

Comment on lines +155 to +156
// FIX: When TTL is 0, implement exponential backoff with jitter
// Start at 5s, double each time, max is VaultDefaultLeaseDuration or 300s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to mention something to the effect that Vault responds with a TTL of 0 when there's been an error rotating the secret on the Vault side, so this is effectively retrying an error condition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added relavent comment

Comment on lines +157 to +165
backoff := 5 * (1 << uint(retryCount)) // 5s, 10s, 20s, 40s, 80s...
maxBackoff := int(VaultDefaultLeaseDuration.Seconds())
if maxBackoff <= 0 {
maxBackoff = 300 // Default max of 300 seconds if not configured
}
if backoff > maxBackoff {
backoff = maxBackoff
}
backoffDur := time.Duration(backoff) * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using the cenkalti/backoff library here? I'm not sure it'll fit this exact use case, but it's pretty widely used in Vault and it's already an indirect import in this project.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are multiple libraries that are being used for the same use case, I have changed this to cenkalti-v4

} else if ttlData == 0 {
// FIX: When TTL is 0, implement exponential backoff with jitter
// Start at 5s, double each time, max is VaultDefaultLeaseDuration or 300s
backoff := 5 * (1 << uint(retryCount)) // 5s, 10s, 20s, 40s, 80s...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was it decided to start at 5s for the first retry? Should it retry quicker than that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to default 2 series instead of 5 from other references in the repo

Comment on lines +149 to +153
if ttlData > 0 {
log.Printf("[DEBUG] Found rotation_period and set lease duration to %d seconds", ttlData)
// Add a second for cushion
base = int(ttlData) + 1
rotatingSecret = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like the retryCount should be reset when this condition is hit? Since we're treating ttl=0 as an error to retry, and ttl > 0 as a successful fetch?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, I made that change. done

@hashicorp hashicorp deleted a comment from ibm-mend-app bot Feb 16, 2026
@aswanidutt aswanidutt requested a review from tvoran February 16, 2026 17:33
// Add a second for cushion
base = int(ttlData) + 1
rotatingSecret = true
} else if ttlData == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in looking at consul-template a bit more it looks like there's already a notion of retries with backoff that is set in the RetryConfig:

Retry *RetryConfig `mapstructure:"retry"`

I'm wondering if we've tried just returning an error from VaultReadQuery.Fetch() and VaultWriteQuery.Fetch() in the ttl=0 case. Would that trigger the existing retry mechanism?

Copy link
Collaborator Author

@aswanidutt aswanidutt Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tvoran thanks for that input, thats alot easier than I thought - here is the PR - #2136
but the only caveat is that there is default max 12 retries and maxbackoff of 1 min for retry.go. So, I had to use cenkaltiv4 in view.go to customize the retry backoff with max backoff of 5m and unlimited retries.

@aswanidutt aswanidutt closed this Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants