Bug: Rate limiter bypassed for Observe errors - Immediate requeue (0s) causes API throttling storms

# Bug Report: Rate limiter bypassed for Observe errors - Immediate requeue (0s) causes API throttling storms

## Summary

When a managed resource fails during the **Observe** phase (e.g., AWS API throttling errors), the reconcile request is requeued with `rateLimiterName=""` and `when=0s`, completely bypassing exponential backoff. This causes resources to retry every 1-3 seconds indefinitely, creating a retry storm that worsens API rate limiting.

## Affected Versions

- `github.com/crossplane/upjet` v1.5.0
- `github.com/crossplane/crossplane-runtime` v1.17.0
- `sigs.k8s.io/controller-runtime` v0.18.2
- Provider: `xpkg.upbound.io/upbound/provider-aws-rds:v1.21.0`

## Expected Behavior

When a resource fails during Observe (e.g., "Throttling: Rate exceeded"), it should use exponential backoff before retrying:
- 1s → 2s → 4s → 8s → 16s → 32s → 60s (max)

## Actual Behavior

Failed resources are requeued **immediately** with `when=0s`, causing:
- Same resource retries every 1-3 seconds
- Exponential growth in API calls
- Worsening of rate limiting conditions
- Eventually all resources in the provider hit throttling

## Evidence from Logs

### 1. Error occurs during Observe

```json
{
  "level": "debug",
  "ts": "2025-01-28T20:34:29Z",
  "msg": "Cannot observe external resource",
  "error": "cannot run refresh: refresh failed: Instance: i]Error: reading RDS DB Instance (edu-202601270991-stg-grm5p-9rt7g): operation error RDS: DescribeDBInstances, https response error StatusCode: 400, RequestID: xxx, api error Throttling: Rate exceeded"
}
```

### 2. Immediate requeue with NO rate limiter

```json
{
  "level": "debug", 
  "ts": "2025-01-28T20:34:29Z",
  "msg": "Reconcile request has been requeued.",
  "rateLimiterName": "",
  "when": "0s"
}
```

### 3. Pattern repeats every 1-3 seconds for the SAME resource

```
2025-01-28T20:34:29Z "Reconcile request has been requeued." "rateLimiterName": "", "when": "0s"
2025-01-28T20:34:30Z "Reconcile request has been requeued." "rateLimiterName": "", "when": "0s"
2025-01-28T20:34:33Z "Reconcile request has been requeued." "rateLimiterName": "", "when": "0s"
2025-01-28T20:34:36Z "Reconcile request has been requeued." "rateLimiterName": "", "when": "0s"
2025-01-28T20:34:38Z "Reconcile request has been requeued." "rateLimiterName": "", "when": "0s"
```

**Note**: Every single requeue shows:
- `"rateLimiterName": ""` (empty - no rate limiter applied)
- `"when": "0s"` (zero delay)

## Root Cause Analysis

### Location: `pkg/controller/handler/eventhandler.go` lines 57-74

```go
func (e *EventHandler) RequestReconcile(rateLimiterName string, name types.NamespacedName, failureLimit *int) bool {
    e.mu.Lock()
    defer e.mu.Unlock()
    if e.queue == nil {
        return false
    }
    logger := e.logger.WithValues("name", name)
    item := reconcile.Request{
        NamespacedName: name,
    }
    var when time.Duration = 0  // <-- DEFAULT IS 0
    if rateLimiterName != NoRateLimiter {  // <-- NoRateLimiter = ""
        rateLimiter := e.rateLimiterMap[rateLimiterName]
        if rateLimiter == nil {
            rateLimiter = workqueue.DefaultTypedControllerRateLimiter[reconcile.Request]()
            e.rateLimiterMap[rateLimiterName] = rateLimiter
        }
        // ... exponential backoff logic here
        when = rateLimiter.When(item)
    }
    e.queue.AddAfter(item, when)  // <-- ADDS WITH when=0 WHEN rateLimiterName=""
    logger.Debug("Reconcile request has been requeued.", "rateLimiterName", rateLimiterName, "when", when)
    return true
}
```

### The Problem

1. When `rateLimiterName == ""` (empty string), the condition `rateLimiterName != NoRateLimiter` is **false**
2. The exponential backoff block is **skipped entirely**
3. `when` remains at its default value of `0`
4. The request is added to the queue with **zero delay**

### Where the empty rateLimiterName comes from

In `pkg/controller/api.go`, the `rateLimiterCallback` is only set for async callback errors:

```go
rateLimiter := handler.NoRateLimiter  // Default is ""
switch {
case err != nil:
    rateLimiter = rateLimiterCallback  // Only for async callback errors
default:
    ac.eventHandler.Forget(rateLimiterCallback, nn)
}
```

Observe errors that happen in the main reconcile loop (not async callbacks) don't have a rate limiter name assigned.

## Impact

With 100+ RDS instances in our environment:
1. One resource hits AWS throttling
2. It retries immediately (0s delay)
3. More resources start hitting throttling
4. All resources retry immediately
5. **Exponential explosion** of API calls
6. All RDS operations in the account are blocked

## Suggested Fix

When `rateLimiterName == ""`, use a default rate limiter with exponential backoff instead of zero delay:

```go
func (e *EventHandler) RequestReconcile(rateLimiterName string, name types.NamespacedName, failureLimit *int) bool {
    // ...
    var when time.Duration = 0
    
    // Always use a rate limiter for requeues
    effectiveRateLimiterName := rateLimiterName
    if effectiveRateLimiterName == NoRateLimiter {
        effectiveRateLimiterName = "default"  // Use a default rate limiter
    }
    
    rateLimiter := e.rateLimiterMap[effectiveRateLimiterName]
    if rateLimiter == nil {
        rateLimiter = workqueue.DefaultTypedControllerRateLimiter[reconcile.Request]()
        e.rateLimiterMap[effectiveRateLimiterName] = rateLimiter
    }
    // ... rest of backoff logic
}
```

## Workaround Attempted (Ineffective)

We tried reducing `--max-reconcile-rate` from 10 → 5 → 3, but this doesn't help because:
- The rate limiter bypass means resources retry **faster than any max-reconcile-rate can limit**
- The problem is the **zero delay between retries**, not the concurrency

## Environment

- Kubernetes: EKS 1.29
- Crossplane: 1.17.0
- Provider: upbound/provider-aws-rds v1.21.0
- Resources affected: ~150 RDS instances across 3 AWS accounts

## Reproduction Steps

1. Deploy provider-aws-rds with multiple RDS instances
2. Trigger AWS API throttling (or simulate by blocking API calls)
3. Observe logs showing `"rateLimiterName": "", "when": "0s"` for every retry
4. Watch as retry frequency stays constant at 1-3 seconds instead of increasing exponentially

## Related

This may also affect other providers using the same EventHandler pattern in upjet.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Rate limiter bypassed for Observe errors - Immediate requeue (0s) causes API throttling storms #592

Bug Report: Rate limiter bypassed for Observe errors - Immediate requeue (0s) causes API throttling storms

Summary

Affected Versions

Expected Behavior

Actual Behavior

Evidence from Logs

1. Error occurs during Observe

2. Immediate requeue with NO rate limiter

3. Pattern repeats every 1-3 seconds for the SAME resource

Root Cause Analysis

Location: `pkg/controller/handler/eventhandler.go` lines 57-74

The Problem

Where the empty rateLimiterName comes from

Impact

Suggested Fix

Workaround Attempted (Ineffective)

Environment

Reproduction Steps

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Rate limiter bypassed for Observe errors - Immediate requeue (0s) causes API throttling storms #592

Description

Bug Report: Rate limiter bypassed for Observe errors - Immediate requeue (0s) causes API throttling storms

Summary

Affected Versions

Expected Behavior

Actual Behavior

Evidence from Logs

1. Error occurs during Observe

2. Immediate requeue with NO rate limiter

3. Pattern repeats every 1-3 seconds for the SAME resource

Root Cause Analysis

Location: pkg/controller/handler/eventhandler.go lines 57-74

The Problem

Where the empty rateLimiterName comes from

Impact

Suggested Fix

Workaround Attempted (Ineffective)

Environment

Reproduction Steps

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Location: `pkg/controller/handler/eventhandler.go` lines 57-74