Skip to content

[FEATURE] Per-Shard Resilience (Circuit Breakers + Retry) #643

@dlrivada

Description

@dlrivada

Summary

Add per-shard circuit breakers, retry policies, and bulkhead isolation using Polly. Health monitoring (IShardedDatabaseHealthMonitor) currently exists but doesn't influence routing decisions. This enhancement adds proactive shard isolation.

Motivation

If shard-3 becomes slow or unhealthy, requests to shard-1 and shard-2 should continue unaffected. Currently, scatter-gather handles partial failures reactively, but there's no proactive isolation. Enterprise systems need:

  • Independent circuit breakers per shard
  • Health-aware routing that excludes unhealthy shards
  • Independent backoff per shard
  • Bulkhead isolation (thread/semaphore pools per shard)

Industry comparison:

  • Netflix Hystrix: Per-dependency circuit breakers with bulkhead isolation — the pattern that inspired Polly
  • Vitess: Per-tablet (shard) health tracking with TabletHealthCheck — unhealthy tablets removed from routing
  • CockroachDB: Per-range (shard) leaseholder tracking — automatic rerouting on leaseholder failure
  • Citus: Worker node health monitoring — coordinator excludes unhealthy workers from distributed queries
  • Encina (current): IShardedDatabaseHealthMonitor exists but is read-only — doesn't influence routing decisions

Problem with current approach: Health monitoring is observational only. When a shard degrades, requests continue to be routed to it until it fully fails. There's no circuit breaker to fast-fail, no bulkhead to prevent thread pool exhaustion, and no health-aware routing to bypass degraded shards.

Proposed Solution

// Per-shard circuit breaker pipeline behavior
services.AddEncinaSharding<Order>(options => { ... })
    .WithPerShardResilience(resilience => {
        resilience.CircuitBreaker(cb => {
            cb.FailureThreshold = 5;
            cb.DurationOfBreak = TimeSpan.FromSeconds(30);
            cb.SamplingDuration = TimeSpan.FromSeconds(60);
        });
        resilience.Retry(retry => {
            retry.MaxRetries = 3;
            retry.BackoffType = BackoffType.Exponential;
            retry.UseJitter = true;
        });
        resilience.Bulkhead(bulkhead => {
            bulkhead.MaxConcurrency = 10;
            bulkhead.MaxQueueSize = 20;
        });
        resilience.Timeout(timeout => {
            timeout.Timeout = TimeSpan.FromSeconds(5);
        });
    });

// Per-shard resilience pipeline — one Polly pipeline per (EntityType, ShardId)
public interface IShardResiliencePipelineProvider
{
    ResiliencePipeline GetPipeline(string shardId);
    ResiliencePipeline<TResult> GetPipeline<TResult>(string shardId);
    CircuitBreakerState GetCircuitState(string shardId);
}

// Health-aware shard router — wraps IShardRouter, excludes unhealthy shards
public interface IHealthAwareShardRouter : IShardRouter
{
    Task<Either<EncinaError, string>> RouteToHealthyShardAsync<T>(
        T entity, CancellationToken ct) where T : IShardable;
    IReadOnlyList<ShardHealthStatus> GetShardStatuses();
}

public record ShardHealthStatus(
    string ShardId,
    CircuitBreakerState CircuitState,
    int ActiveRequests,     // Bulkhead in-use
    int QueuedRequests,     // Bulkhead queued
    TimeSpan AvgLatency,
    DateTime LastFailureUtc);

// Pipeline behavior that applies per-shard resilience
public class ShardResiliencePipelineBehavior<TRequest, TResponse>
    : IPipelineBehavior<TRequest, TResponse>
{
    // Extracts shard ID from request context, applies the per-shard pipeline
}

Key components:

  • IShardResiliencePipelineProvider: manages one Polly ResiliencePipeline per shard ID
  • IHealthAwareShardRouter: decorates IShardRouter, excludes shards with open circuit breakers
  • ShardResiliencePipelineBehavior<,>: MediatR pipeline behavior applying per-shard resilience
  • Per-shard retry with independent exponential backoff + jitter
  • Bulkhead isolation: separate concurrency pools per shard
  • Integration with existing ShardHealthResult and IShardedDatabaseHealthMonitor

Alternatives Considered

  1. Global circuit breaker: One breaker for all shards. Defeats the purpose — one bad shard trips the breaker for all shards, causing total outage.
  2. Database-level health checks only: Reactive, not proactive. Detects failure after it happens but doesn't prevent cascading failures during degradation.
  3. Custom resilience without Polly: Build from scratch. Polly is the .NET standard for resilience, well-tested, and already an Encina dependency.

Affected Packages

  • Encina (core) — IHealthAwareShardRouter, ShardHealthStatus, shard routing integration
  • Encina.Polly — IShardResiliencePipelineProvider, ShardResiliencePipelineBehavior, per-shard pipeline management
  • Encina.OpenTelemetry — Circuit breaker state metrics, per-shard resilience traces

Provider Implementation Matrix

This feature is primarily in Encina.Polly and Encina (core) — it's a decorator/behavior pattern that wraps existing routing, not provider-specific database code.

No 13-provider matrix needed. However, integration tests must verify that per-shard resilience works correctly with connections from all provider types.

Observability

Metrics (OpenTelemetry)

Metric Name Type Description
encina.sharding.resilience.circuit_state_changes_total Counter Circuit breaker state transitions, tagged by shard.id, from_state, to_state
encina.sharding.resilience.circuit_open_duration_ms Histogram Duration each circuit breaker stays open, tagged by shard.id
encina.sharding.resilience.retry_total Counter Retry attempts per shard, tagged by shard.id, attempt_number
encina.sharding.resilience.bulkhead_active_count Gauge Active requests in bulkhead per shard, tagged by shard.id
encina.sharding.resilience.bulkhead_queued_count Gauge Queued requests in bulkhead per shard, tagged by shard.id
encina.sharding.resilience.bulkhead_rejected_total Counter Requests rejected by bulkhead (queue full), tagged by shard.id
encina.sharding.resilience.timeout_total Counter Timeout occurrences per shard, tagged by shard.id
encina.sharding.resilience.healthy_shard_count Gauge Number of shards with closed circuit breakers

Traces / Spans

Span Name / Attribute Applied To Description
encina.sharding.resilience.execute New span Per-shard resilience pipeline execution
shard.id encina.sharding.resilience.execute Target shard
shard.circuit.state encina.sharding.resilience.execute Circuit state at execution time (Closed/Open/HalfOpen)
shard.resilience.outcome encina.sharding.resilience.execute Success, Retry, CircuitOpen, BulkheadRejected, Timeout
shard.resilience.retry_count encina.sharding.resilience.execute Number of retries before success/failure

Health Checks

  • ShardCircuitBreakerHealthCheck: Reports unhealthy if any shard has open circuit breaker for longer than configurable threshold. Degraded if any shard is in half-open state.
  • ShardBulkheadHealthCheck: Reports degraded if any shard bulkhead queue is above 80% capacity.

Structured Logging

  • ShardCircuitOpened (Warning): Shard ID, failure count, break duration, last error
  • ShardCircuitClosed (Information): Shard ID, duration in open state, successful probe count
  • ShardCircuitHalfOpen (Information): Shard ID, probing for recovery
  • ShardBulkheadRejected (Warning): Shard ID, active count, queue count, max concurrency
  • ShardRetryAttempt (Debug): Shard ID, attempt number, delay, exception type
  • ShardExcludedFromRouting (Warning): Shard ID, reason (circuit open / health check failed)

Test Matrix

Per CLAUDE.md Testing Standards

Test Type Required? Scope Notes
UnitTests Required Pipeline provider, health-aware router, circuit state management Mock Polly pipelines, test routing exclusion logic
GuardTests Required ShardResiliencePipelineProvider, configuration builders Null shard IDs, invalid thresholds, zero concurrency
ContractTests Required IShardResiliencePipelineProvider, IHealthAwareShardRouter Verify decorator contract: wraps any IShardRouter
PropertyTests Required Circuit breaker determinism, bulkhead fairness Same failure pattern → same circuit state always; no starvation
IntegrationTests Required Polly pipelines with real Polly ResiliencePipeline instances Real circuit breakers, verify state transitions
LoadTests Required Concurrent shard access with failing shards Verify isolation: healthy shards unaffected by degraded ones
BenchmarkTests Justify Pipeline overhead is Polly's concern, not ours Benchmark per-shard pipeline lookup overhead only

Implementation Tasks

Core Abstractions

  • IShardResiliencePipelineProvider interface — pipeline per shard
  • IHealthAwareShardRouter interface — extends IShardRouter with health awareness
  • ShardHealthStatus record — shard health snapshot
  • ShardResilienceOptions — configuration for circuit breaker, retry, bulkhead, timeout
  • HealthAwareShardRouter — decorator implementation, excludes shards with open circuits
  • Integration with existing IShardedDatabaseHealthMonitor and ShardHealthResult

Polly Integration (Encina.Polly)

  • ShardResiliencePipelineProvider — manages ConcurrentDictionary<string, ResiliencePipeline>
  • ShardResiliencePipelineBehavior<TRequest, TResponse> — MediatR pipeline behavior
  • Per-shard Polly pipeline builder: circuit breaker + retry + bulkhead + timeout composition
  • WithPerShardResilience() fluent extension method on sharding builder
  • DI registration: AddPerShardResilience() in ServiceCollectionExtensions

Observability

  • Counters: circuit state changes, retries, bulkhead rejections, timeouts
  • Gauges: bulkhead active/queued count, healthy shard count
  • Histogram: circuit open duration
  • Trace attributes on resilience execution span
  • Health checks: ShardCircuitBreakerHealthCheck, ShardBulkheadHealthCheck
  • Structured logging events (6 events as specified)

Testing

  • UnitTests: Pipeline provider, health-aware router, circuit state management
  • GuardTests: Null shard IDs, invalid configuration values
  • ContractTests: IShardResiliencePipelineProvider and IHealthAwareShardRouter contracts
  • PropertyTests: Circuit breaker determinism, bulkhead fairness under random failure patterns
  • IntegrationTests: Real Polly pipelines, state transitions, health-aware routing
  • LoadTests: Concurrent shard access with selective shard failures

Documentation

  • XML doc comments on all public resilience APIs
  • Usage guide: per-shard resilience configuration, circuit breaker tuning, bulkhead sizing
  • Architecture guide: resilience pipeline composition diagram
  • CHANGELOG.md update

Acceptance Criteria

  • IShardResiliencePipelineProvider provides independent pipelines per shard
  • Circuit breakers operate independently per shard (shard-3 open doesn't affect shard-1)
  • IHealthAwareShardRouter excludes shards with open circuit breakers
  • Retry policies apply per-shard with independent backoff
  • Bulkhead isolation prevents one shard from exhausting thread pools
  • Circuit state transitions are observable via metrics and logs
  • Integration with existing IShardedDatabaseHealthMonitor works
  • OpenTelemetry metrics, traces, and health checks added
  • All test types per Test Matrix implemented (or justified)
  • Zero build warnings
  • Code coverage >= 85%

Documentation

  • XML doc comments with <example> tags
  • Usage guide: per-shard resilience patterns, tuning guidance
  • Architecture: resilience pipeline composition and routing flow
  • CHANGELOG.md update

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-pollyPolly resilience library integrationarea-resilienceResilience and fault tolerance patternsarea-shardingDatabase sharding and horizontal partitioningcomplexity-mediumComplexity: MediumenhancementNew feature or requestpattern-circuit-breakerCircuit Breaker resilience patternpriority-highPriority: High (⭐⭐⭐⭐)

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions