-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
Add per-shard circuit breakers, retry policies, and bulkhead isolation using Polly. Health monitoring (IShardedDatabaseHealthMonitor) currently exists but doesn't influence routing decisions. This enhancement adds proactive shard isolation.
Motivation
If shard-3 becomes slow or unhealthy, requests to shard-1 and shard-2 should continue unaffected. Currently, scatter-gather handles partial failures reactively, but there's no proactive isolation. Enterprise systems need:
- Independent circuit breakers per shard
- Health-aware routing that excludes unhealthy shards
- Independent backoff per shard
- Bulkhead isolation (thread/semaphore pools per shard)
Industry comparison:
- Netflix Hystrix: Per-dependency circuit breakers with bulkhead isolation — the pattern that inspired Polly
- Vitess: Per-tablet (shard) health tracking with
TabletHealthCheck— unhealthy tablets removed from routing - CockroachDB: Per-range (shard) leaseholder tracking — automatic rerouting on leaseholder failure
- Citus: Worker node health monitoring — coordinator excludes unhealthy workers from distributed queries
- Encina (current):
IShardedDatabaseHealthMonitorexists but is read-only — doesn't influence routing decisions
Problem with current approach: Health monitoring is observational only. When a shard degrades, requests continue to be routed to it until it fully fails. There's no circuit breaker to fast-fail, no bulkhead to prevent thread pool exhaustion, and no health-aware routing to bypass degraded shards.
Proposed Solution
// Per-shard circuit breaker pipeline behavior
services.AddEncinaSharding<Order>(options => { ... })
.WithPerShardResilience(resilience => {
resilience.CircuitBreaker(cb => {
cb.FailureThreshold = 5;
cb.DurationOfBreak = TimeSpan.FromSeconds(30);
cb.SamplingDuration = TimeSpan.FromSeconds(60);
});
resilience.Retry(retry => {
retry.MaxRetries = 3;
retry.BackoffType = BackoffType.Exponential;
retry.UseJitter = true;
});
resilience.Bulkhead(bulkhead => {
bulkhead.MaxConcurrency = 10;
bulkhead.MaxQueueSize = 20;
});
resilience.Timeout(timeout => {
timeout.Timeout = TimeSpan.FromSeconds(5);
});
});
// Per-shard resilience pipeline — one Polly pipeline per (EntityType, ShardId)
public interface IShardResiliencePipelineProvider
{
ResiliencePipeline GetPipeline(string shardId);
ResiliencePipeline<TResult> GetPipeline<TResult>(string shardId);
CircuitBreakerState GetCircuitState(string shardId);
}
// Health-aware shard router — wraps IShardRouter, excludes unhealthy shards
public interface IHealthAwareShardRouter : IShardRouter
{
Task<Either<EncinaError, string>> RouteToHealthyShardAsync<T>(
T entity, CancellationToken ct) where T : IShardable;
IReadOnlyList<ShardHealthStatus> GetShardStatuses();
}
public record ShardHealthStatus(
string ShardId,
CircuitBreakerState CircuitState,
int ActiveRequests, // Bulkhead in-use
int QueuedRequests, // Bulkhead queued
TimeSpan AvgLatency,
DateTime LastFailureUtc);
// Pipeline behavior that applies per-shard resilience
public class ShardResiliencePipelineBehavior<TRequest, TResponse>
: IPipelineBehavior<TRequest, TResponse>
{
// Extracts shard ID from request context, applies the per-shard pipeline
}Key components:
IShardResiliencePipelineProvider: manages one PollyResiliencePipelineper shard IDIHealthAwareShardRouter: decoratesIShardRouter, excludes shards with open circuit breakersShardResiliencePipelineBehavior<,>: MediatR pipeline behavior applying per-shard resilience- Per-shard retry with independent exponential backoff + jitter
- Bulkhead isolation: separate concurrency pools per shard
- Integration with existing
ShardHealthResultandIShardedDatabaseHealthMonitor
Alternatives Considered
- Global circuit breaker: One breaker for all shards. Defeats the purpose — one bad shard trips the breaker for all shards, causing total outage.
- Database-level health checks only: Reactive, not proactive. Detects failure after it happens but doesn't prevent cascading failures during degradation.
- Custom resilience without Polly: Build from scratch. Polly is the .NET standard for resilience, well-tested, and already an Encina dependency.
Affected Packages
- Encina (core) —
IHealthAwareShardRouter,ShardHealthStatus, shard routing integration - Encina.Polly —
IShardResiliencePipelineProvider,ShardResiliencePipelineBehavior, per-shard pipeline management - Encina.OpenTelemetry — Circuit breaker state metrics, per-shard resilience traces
Provider Implementation Matrix
This feature is primarily in
Encina.PollyandEncina(core) — it's a decorator/behavior pattern that wraps existing routing, not provider-specific database code.No 13-provider matrix needed. However, integration tests must verify that per-shard resilience works correctly with connections from all provider types.
Observability
Metrics (OpenTelemetry)
| Metric Name | Type | Description |
|---|---|---|
encina.sharding.resilience.circuit_state_changes_total |
Counter | Circuit breaker state transitions, tagged by shard.id, from_state, to_state |
encina.sharding.resilience.circuit_open_duration_ms |
Histogram | Duration each circuit breaker stays open, tagged by shard.id |
encina.sharding.resilience.retry_total |
Counter | Retry attempts per shard, tagged by shard.id, attempt_number |
encina.sharding.resilience.bulkhead_active_count |
Gauge | Active requests in bulkhead per shard, tagged by shard.id |
encina.sharding.resilience.bulkhead_queued_count |
Gauge | Queued requests in bulkhead per shard, tagged by shard.id |
encina.sharding.resilience.bulkhead_rejected_total |
Counter | Requests rejected by bulkhead (queue full), tagged by shard.id |
encina.sharding.resilience.timeout_total |
Counter | Timeout occurrences per shard, tagged by shard.id |
encina.sharding.resilience.healthy_shard_count |
Gauge | Number of shards with closed circuit breakers |
Traces / Spans
| Span Name / Attribute | Applied To | Description |
|---|---|---|
encina.sharding.resilience.execute |
New span | Per-shard resilience pipeline execution |
shard.id |
encina.sharding.resilience.execute |
Target shard |
shard.circuit.state |
encina.sharding.resilience.execute |
Circuit state at execution time (Closed/Open/HalfOpen) |
shard.resilience.outcome |
encina.sharding.resilience.execute |
Success, Retry, CircuitOpen, BulkheadRejected, Timeout |
shard.resilience.retry_count |
encina.sharding.resilience.execute |
Number of retries before success/failure |
Health Checks
ShardCircuitBreakerHealthCheck: Reports unhealthy if any shard has open circuit breaker for longer than configurable threshold. Degraded if any shard is in half-open state.ShardBulkheadHealthCheck: Reports degraded if any shard bulkhead queue is above 80% capacity.
Structured Logging
ShardCircuitOpened(Warning): Shard ID, failure count, break duration, last errorShardCircuitClosed(Information): Shard ID, duration in open state, successful probe countShardCircuitHalfOpen(Information): Shard ID, probing for recoveryShardBulkheadRejected(Warning): Shard ID, active count, queue count, max concurrencyShardRetryAttempt(Debug): Shard ID, attempt number, delay, exception typeShardExcludedFromRouting(Warning): Shard ID, reason (circuit open / health check failed)
Test Matrix
Per CLAUDE.md Testing Standards
| Test Type | Required? | Scope | Notes |
|---|---|---|---|
| UnitTests | Required | Pipeline provider, health-aware router, circuit state management | Mock Polly pipelines, test routing exclusion logic |
| GuardTests | Required | ShardResiliencePipelineProvider, configuration builders |
Null shard IDs, invalid thresholds, zero concurrency |
| ContractTests | Required | IShardResiliencePipelineProvider, IHealthAwareShardRouter |
Verify decorator contract: wraps any IShardRouter |
| PropertyTests | Required | Circuit breaker determinism, bulkhead fairness | Same failure pattern → same circuit state always; no starvation |
| IntegrationTests | Required | Polly pipelines with real Polly ResiliencePipeline instances | Real circuit breakers, verify state transitions |
| LoadTests | Required | Concurrent shard access with failing shards | Verify isolation: healthy shards unaffected by degraded ones |
| BenchmarkTests | Justify | Pipeline overhead is Polly's concern, not ours | Benchmark per-shard pipeline lookup overhead only |
Implementation Tasks
Core Abstractions
-
IShardResiliencePipelineProviderinterface — pipeline per shard -
IHealthAwareShardRouterinterface — extendsIShardRouterwith health awareness -
ShardHealthStatusrecord — shard health snapshot -
ShardResilienceOptions— configuration for circuit breaker, retry, bulkhead, timeout -
HealthAwareShardRouter— decorator implementation, excludes shards with open circuits - Integration with existing
IShardedDatabaseHealthMonitorandShardHealthResult
Polly Integration (Encina.Polly)
-
ShardResiliencePipelineProvider— managesConcurrentDictionary<string, ResiliencePipeline> -
ShardResiliencePipelineBehavior<TRequest, TResponse>— MediatR pipeline behavior - Per-shard Polly pipeline builder: circuit breaker + retry + bulkhead + timeout composition
-
WithPerShardResilience()fluent extension method on sharding builder - DI registration:
AddPerShardResilience()inServiceCollectionExtensions
Observability
- Counters: circuit state changes, retries, bulkhead rejections, timeouts
- Gauges: bulkhead active/queued count, healthy shard count
- Histogram: circuit open duration
- Trace attributes on resilience execution span
- Health checks:
ShardCircuitBreakerHealthCheck,ShardBulkheadHealthCheck - Structured logging events (6 events as specified)
Testing
- UnitTests: Pipeline provider, health-aware router, circuit state management
- GuardTests: Null shard IDs, invalid configuration values
- ContractTests:
IShardResiliencePipelineProviderandIHealthAwareShardRoutercontracts - PropertyTests: Circuit breaker determinism, bulkhead fairness under random failure patterns
- IntegrationTests: Real Polly pipelines, state transitions, health-aware routing
- LoadTests: Concurrent shard access with selective shard failures
Documentation
- XML doc comments on all public resilience APIs
- Usage guide: per-shard resilience configuration, circuit breaker tuning, bulkhead sizing
- Architecture guide: resilience pipeline composition diagram
- CHANGELOG.md update
Acceptance Criteria
-
IShardResiliencePipelineProviderprovides independent pipelines per shard - Circuit breakers operate independently per shard (shard-3 open doesn't affect shard-1)
-
IHealthAwareShardRouterexcludes shards with open circuit breakers - Retry policies apply per-shard with independent backoff
- Bulkhead isolation prevents one shard from exhausting thread pools
- Circuit state transitions are observable via metrics and logs
- Integration with existing
IShardedDatabaseHealthMonitorworks - OpenTelemetry metrics, traces, and health checks added
- All test types per Test Matrix implemented (or justified)
- Zero build warnings
- Code coverage >= 85%
Documentation
- XML doc comments with
<example>tags - Usage guide: per-shard resilience patterns, tuning guidance
- Architecture: resilience pipeline composition and routing flow
- CHANGELOG.md update
Related Issues
- [FEATURE] Database Sharding Abstractions #289 - Database Sharding Abstractions (parent feature — completed)
- [FEATURE] CDC (Change Data Capture) Pattern #308 - Polly Integration (dependency — provides base Polly infrastructure)
- [FEATURE] Read/Write Separation + Sharding #644 - Read/Write Separation (per-replica resilience extends this pattern)
- [FEATURE] Online Resharding Workflow #648 - Online Resharding (resharding benefits from circuit breaker protection during migration)
- Sharding Enhancement Study — Enhancement [DEBT] Refactor Quartz logging tests to work with LoggerMessage delegates #6, Tier 1