-
Notifications
You must be signed in to change notification settings - Fork 115
Description
Bug Report: Rate limiter bypassed for Observe errors - Immediate requeue (0s) causes API throttling storms
Summary
When a managed resource fails during the Observe phase (e.g., AWS API throttling errors), the reconcile request is requeued with rateLimiterName="" and when=0s, completely bypassing exponential backoff. This causes resources to retry every 1-3 seconds indefinitely, creating a retry storm that worsens API rate limiting.
Affected Versions
github.com/crossplane/upjetv1.5.0github.com/crossplane/crossplane-runtimev1.17.0sigs.k8s.io/controller-runtimev0.18.2- Provider:
xpkg.upbound.io/upbound/provider-aws-rds:v1.21.0
Expected Behavior
When a resource fails during Observe (e.g., "Throttling: Rate exceeded"), it should use exponential backoff before retrying:
- 1s → 2s → 4s → 8s → 16s → 32s → 60s (max)
Actual Behavior
Failed resources are requeued immediately with when=0s, causing:
- Same resource retries every 1-3 seconds
- Exponential growth in API calls
- Worsening of rate limiting conditions
- Eventually all resources in the provider hit throttling
Evidence from Logs
1. Error occurs during Observe
{
"level": "debug",
"ts": "2025-01-28T20:34:29Z",
"msg": "Cannot observe external resource",
"error": "cannot run refresh: refresh failed: Instance: i]Error: reading RDS DB Instance (edu-202601270991-stg-grm5p-9rt7g): operation error RDS: DescribeDBInstances, https response error StatusCode: 400, RequestID: xxx, api error Throttling: Rate exceeded"
}2. Immediate requeue with NO rate limiter
{
"level": "debug",
"ts": "2025-01-28T20:34:29Z",
"msg": "Reconcile request has been requeued.",
"rateLimiterName": "",
"when": "0s"
}3. Pattern repeats every 1-3 seconds for the SAME resource
2025-01-28T20:34:29Z "Reconcile request has been requeued." "rateLimiterName": "", "when": "0s"
2025-01-28T20:34:30Z "Reconcile request has been requeued." "rateLimiterName": "", "when": "0s"
2025-01-28T20:34:33Z "Reconcile request has been requeued." "rateLimiterName": "", "when": "0s"
2025-01-28T20:34:36Z "Reconcile request has been requeued." "rateLimiterName": "", "when": "0s"
2025-01-28T20:34:38Z "Reconcile request has been requeued." "rateLimiterName": "", "when": "0s"
Note: Every single requeue shows:
"rateLimiterName": ""(empty - no rate limiter applied)"when": "0s"(zero delay)
Root Cause Analysis
Location: pkg/controller/handler/eventhandler.go lines 57-74
func (e *EventHandler) RequestReconcile(rateLimiterName string, name types.NamespacedName, failureLimit *int) bool {
e.mu.Lock()
defer e.mu.Unlock()
if e.queue == nil {
return false
}
logger := e.logger.WithValues("name", name)
item := reconcile.Request{
NamespacedName: name,
}
var when time.Duration = 0 // <-- DEFAULT IS 0
if rateLimiterName != NoRateLimiter { // <-- NoRateLimiter = ""
rateLimiter := e.rateLimiterMap[rateLimiterName]
if rateLimiter == nil {
rateLimiter = workqueue.DefaultTypedControllerRateLimiter[reconcile.Request]()
e.rateLimiterMap[rateLimiterName] = rateLimiter
}
// ... exponential backoff logic here
when = rateLimiter.When(item)
}
e.queue.AddAfter(item, when) // <-- ADDS WITH when=0 WHEN rateLimiterName=""
logger.Debug("Reconcile request has been requeued.", "rateLimiterName", rateLimiterName, "when", when)
return true
}The Problem
- When
rateLimiterName == ""(empty string), the conditionrateLimiterName != NoRateLimiteris false - The exponential backoff block is skipped entirely
whenremains at its default value of0- The request is added to the queue with zero delay
Where the empty rateLimiterName comes from
In pkg/controller/api.go, the rateLimiterCallback is only set for async callback errors:
rateLimiter := handler.NoRateLimiter // Default is ""
switch {
case err != nil:
rateLimiter = rateLimiterCallback // Only for async callback errors
default:
ac.eventHandler.Forget(rateLimiterCallback, nn)
}Observe errors that happen in the main reconcile loop (not async callbacks) don't have a rate limiter name assigned.
Impact
With 100+ RDS instances in our environment:
- One resource hits AWS throttling
- It retries immediately (0s delay)
- More resources start hitting throttling
- All resources retry immediately
- Exponential explosion of API calls
- All RDS operations in the account are blocked
Suggested Fix
When rateLimiterName == "", use a default rate limiter with exponential backoff instead of zero delay:
func (e *EventHandler) RequestReconcile(rateLimiterName string, name types.NamespacedName, failureLimit *int) bool {
// ...
var when time.Duration = 0
// Always use a rate limiter for requeues
effectiveRateLimiterName := rateLimiterName
if effectiveRateLimiterName == NoRateLimiter {
effectiveRateLimiterName = "default" // Use a default rate limiter
}
rateLimiter := e.rateLimiterMap[effectiveRateLimiterName]
if rateLimiter == nil {
rateLimiter = workqueue.DefaultTypedControllerRateLimiter[reconcile.Request]()
e.rateLimiterMap[effectiveRateLimiterName] = rateLimiter
}
// ... rest of backoff logic
}Workaround Attempted (Ineffective)
We tried reducing --max-reconcile-rate from 10 → 5 → 3, but this doesn't help because:
- The rate limiter bypass means resources retry faster than any max-reconcile-rate can limit
- The problem is the zero delay between retries, not the concurrency
Environment
- Kubernetes: EKS 1.29
- Crossplane: 1.17.0
- Provider: upbound/provider-aws-rds v1.21.0
- Resources affected: ~150 RDS instances across 3 AWS accounts
Reproduction Steps
- Deploy provider-aws-rds with multiple RDS instances
- Trigger AWS API throttling (or simulate by blocking API calls)
- Observe logs showing
"rateLimiterName": "", "when": "0s"for every retry - Watch as retry frequency stays constant at 1-3 seconds instead of increasing exponentially
Related
This may also affect other providers using the same EventHandler pattern in upjet.