Skip to content

Bug: Rate limiter bypassed for Observe errors - Immediate requeue (0s) causes API throttling storms #592

@MarlonHeiber

Description

@MarlonHeiber

Bug Report: Rate limiter bypassed for Observe errors - Immediate requeue (0s) causes API throttling storms

Summary

When a managed resource fails during the Observe phase (e.g., AWS API throttling errors), the reconcile request is requeued with rateLimiterName="" and when=0s, completely bypassing exponential backoff. This causes resources to retry every 1-3 seconds indefinitely, creating a retry storm that worsens API rate limiting.

Affected Versions

  • github.com/crossplane/upjet v1.5.0
  • github.com/crossplane/crossplane-runtime v1.17.0
  • sigs.k8s.io/controller-runtime v0.18.2
  • Provider: xpkg.upbound.io/upbound/provider-aws-rds:v1.21.0

Expected Behavior

When a resource fails during Observe (e.g., "Throttling: Rate exceeded"), it should use exponential backoff before retrying:

  • 1s → 2s → 4s → 8s → 16s → 32s → 60s (max)

Actual Behavior

Failed resources are requeued immediately with when=0s, causing:

  • Same resource retries every 1-3 seconds
  • Exponential growth in API calls
  • Worsening of rate limiting conditions
  • Eventually all resources in the provider hit throttling

Evidence from Logs

1. Error occurs during Observe

{
  "level": "debug",
  "ts": "2025-01-28T20:34:29Z",
  "msg": "Cannot observe external resource",
  "error": "cannot run refresh: refresh failed: Instance: i]Error: reading RDS DB Instance (edu-202601270991-stg-grm5p-9rt7g): operation error RDS: DescribeDBInstances, https response error StatusCode: 400, RequestID: xxx, api error Throttling: Rate exceeded"
}

2. Immediate requeue with NO rate limiter

{
  "level": "debug", 
  "ts": "2025-01-28T20:34:29Z",
  "msg": "Reconcile request has been requeued.",
  "rateLimiterName": "",
  "when": "0s"
}

3. Pattern repeats every 1-3 seconds for the SAME resource

2025-01-28T20:34:29Z "Reconcile request has been requeued." "rateLimiterName": "", "when": "0s"
2025-01-28T20:34:30Z "Reconcile request has been requeued." "rateLimiterName": "", "when": "0s"
2025-01-28T20:34:33Z "Reconcile request has been requeued." "rateLimiterName": "", "when": "0s"
2025-01-28T20:34:36Z "Reconcile request has been requeued." "rateLimiterName": "", "when": "0s"
2025-01-28T20:34:38Z "Reconcile request has been requeued." "rateLimiterName": "", "when": "0s"

Note: Every single requeue shows:

  • "rateLimiterName": "" (empty - no rate limiter applied)
  • "when": "0s" (zero delay)

Root Cause Analysis

Location: pkg/controller/handler/eventhandler.go lines 57-74

func (e *EventHandler) RequestReconcile(rateLimiterName string, name types.NamespacedName, failureLimit *int) bool {
    e.mu.Lock()
    defer e.mu.Unlock()
    if e.queue == nil {
        return false
    }
    logger := e.logger.WithValues("name", name)
    item := reconcile.Request{
        NamespacedName: name,
    }
    var when time.Duration = 0  // <-- DEFAULT IS 0
    if rateLimiterName != NoRateLimiter {  // <-- NoRateLimiter = ""
        rateLimiter := e.rateLimiterMap[rateLimiterName]
        if rateLimiter == nil {
            rateLimiter = workqueue.DefaultTypedControllerRateLimiter[reconcile.Request]()
            e.rateLimiterMap[rateLimiterName] = rateLimiter
        }
        // ... exponential backoff logic here
        when = rateLimiter.When(item)
    }
    e.queue.AddAfter(item, when)  // <-- ADDS WITH when=0 WHEN rateLimiterName=""
    logger.Debug("Reconcile request has been requeued.", "rateLimiterName", rateLimiterName, "when", when)
    return true
}

The Problem

  1. When rateLimiterName == "" (empty string), the condition rateLimiterName != NoRateLimiter is false
  2. The exponential backoff block is skipped entirely
  3. when remains at its default value of 0
  4. The request is added to the queue with zero delay

Where the empty rateLimiterName comes from

In pkg/controller/api.go, the rateLimiterCallback is only set for async callback errors:

rateLimiter := handler.NoRateLimiter  // Default is ""
switch {
case err != nil:
    rateLimiter = rateLimiterCallback  // Only for async callback errors
default:
    ac.eventHandler.Forget(rateLimiterCallback, nn)
}

Observe errors that happen in the main reconcile loop (not async callbacks) don't have a rate limiter name assigned.

Impact

With 100+ RDS instances in our environment:

  1. One resource hits AWS throttling
  2. It retries immediately (0s delay)
  3. More resources start hitting throttling
  4. All resources retry immediately
  5. Exponential explosion of API calls
  6. All RDS operations in the account are blocked

Suggested Fix

When rateLimiterName == "", use a default rate limiter with exponential backoff instead of zero delay:

func (e *EventHandler) RequestReconcile(rateLimiterName string, name types.NamespacedName, failureLimit *int) bool {
    // ...
    var when time.Duration = 0
    
    // Always use a rate limiter for requeues
    effectiveRateLimiterName := rateLimiterName
    if effectiveRateLimiterName == NoRateLimiter {
        effectiveRateLimiterName = "default"  // Use a default rate limiter
    }
    
    rateLimiter := e.rateLimiterMap[effectiveRateLimiterName]
    if rateLimiter == nil {
        rateLimiter = workqueue.DefaultTypedControllerRateLimiter[reconcile.Request]()
        e.rateLimiterMap[effectiveRateLimiterName] = rateLimiter
    }
    // ... rest of backoff logic
}

Workaround Attempted (Ineffective)

We tried reducing --max-reconcile-rate from 10 → 5 → 3, but this doesn't help because:

  • The rate limiter bypass means resources retry faster than any max-reconcile-rate can limit
  • The problem is the zero delay between retries, not the concurrency

Environment

  • Kubernetes: EKS 1.29
  • Crossplane: 1.17.0
  • Provider: upbound/provider-aws-rds v1.21.0
  • Resources affected: ~150 RDS instances across 3 AWS accounts

Reproduction Steps

  1. Deploy provider-aws-rds with multiple RDS instances
  2. Trigger AWS API throttling (or simulate by blocking API calls)
  3. Observe logs showing "rateLimiterName": "", "when": "0s" for every retry
  4. Watch as retry frequency stays constant at 1-3 seconds instead of increasing exponentially

Related

This may also affect other providers using the same EventHandler pattern in upjet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions