|
| 1 | +# Implementing a Correct Retry Mechanism in TypeScript (With Backoff, Jitter & AbortSignal) |
| 2 | + |
| 3 | +## Outline |
| 4 | + |
| 5 | +### 1. Why Naive Retry Is Dangerous |
| 6 | + |
| 7 | +- The instinct to "just retry" masks real failure modes |
| 8 | +- Unbounded retries amplify load on failing services |
| 9 | +- Silent retry loops hide bugs and delay incident detection |
| 10 | + |
| 11 | +### 2. The Thundering Herd Problem |
| 12 | + |
| 13 | +- What happens when thousands of clients retry simultaneously |
| 14 | +- Fixed-delay retries create synchronized spikes |
| 15 | +- Real-world examples: AWS outages, API rate limit cascades |
| 16 | + |
| 17 | +### 3. Why Jitter Matters |
| 18 | + |
| 19 | +- Exponential backoff alone still creates clusters |
| 20 | +- Full jitter vs. equal jitter vs. decorrelated jitter |
| 21 | +- Mathematical intuition: spreading retry attempts across time |
| 22 | +- Reference: AWS Architecture Blog on exponential backoff |
| 23 | + |
| 24 | +### 4. Timeout: Global vs. Per-Attempt |
| 25 | + |
| 26 | +- Per-attempt timeout: each call gets N seconds (incomplete picture) |
| 27 | +- Global timeout: total wall-clock budget across all attempts |
| 28 | +- Why global timeout is the correct default for production systems |
| 29 | +- Edge case: what if the last retry starts just before timeout? |
| 30 | + |
| 31 | +### 5. AbortSignal Correctness |
| 32 | + |
| 33 | +- Why cancellation is not optional in production async code |
| 34 | +- Common mistakes: not cleaning up listeners, ignoring abort during sleep |
| 35 | +- Correct pattern: wiring AbortSignal through retry loop and delay |
| 36 | +- Interaction between AbortSignal and global timeout |
| 37 | + |
| 38 | +### 6. Common Bugs in Retry Loops |
| 39 | + |
| 40 | +- Retrying non-retryable errors (400, 401, 403, 404) |
| 41 | +- Wrapping all errors in a generic "retry failed" error |
| 42 | +- Off-by-one: `maxRetries` vs. total attempts |
| 43 | +- Not clamping delay to remaining timeout budget |
| 44 | +- Leaking timers on abort |
| 45 | + |
| 46 | +### 7. Introducing SmartRetry |
| 47 | + |
| 48 | +- Design goals: correctness, predictability, zero dependencies |
| 49 | +- Default retry policy: network errors, 429, 5xx — stop on 4xx |
| 50 | +- Full jitter by default |
| 51 | +- Global timeout with proper delay clamping |
| 52 | +- AbortSignal support with clean teardown |
| 53 | +- ESM + CJS dual build, fully typed |
| 54 | + |
| 55 | +### 8. Example Usage |
| 56 | + |
| 57 | +- Basic: wrapping a fetch call |
| 58 | +- Advanced: custom retry predicate with structured logging |
| 59 | +- Cancellation: AbortController with timeout fallback |
| 60 | + |
| 61 | +### 9. Benchmarks (Future) |
| 62 | + |
| 63 | +- Retry storm simulation: SmartRetry vs. naive loops |
| 64 | +- Jitter distribution visualization |
| 65 | +- Memory and timer cleanup validation |
| 66 | + |
| 67 | +--- |
| 68 | + |
| 69 | +**Target platforms:** Dev.to, Hashnode, Medium, personal blog |
| 70 | + |
| 71 | +**Estimated length:** 2,500–3,500 words |
| 72 | + |
| 73 | +**Goal:** Establish authority on retry correctness; drive organic traffic to the npm package. |
0 commit comments