[FEATURE] Per-Shard Resilience (Circuit Breakers + Retry)

## Summary

Add per-shard circuit breakers, retry policies, and bulkhead isolation using Polly. Health monitoring (`IShardedDatabaseHealthMonitor`) currently exists but doesn't influence routing decisions. This enhancement adds proactive shard isolation.

## Motivation

If shard-3 becomes slow or unhealthy, requests to shard-1 and shard-2 should continue unaffected. Currently, scatter-gather handles partial failures reactively, but there's no proactive isolation. Enterprise systems need:
- Independent circuit breakers per shard
- Health-aware routing that excludes unhealthy shards
- Independent backoff per shard
- Bulkhead isolation (thread/semaphore pools per shard)

**Industry comparison**:
- **Netflix Hystrix**: Per-dependency circuit breakers with bulkhead isolation — the pattern that inspired Polly
- **Vitess**: Per-tablet (shard) health tracking with `TabletHealthCheck` — unhealthy tablets removed from routing
- **CockroachDB**: Per-range (shard) leaseholder tracking — automatic rerouting on leaseholder failure
- **Citus**: Worker node health monitoring — coordinator excludes unhealthy workers from distributed queries
- **Encina (current)**: `IShardedDatabaseHealthMonitor` exists but is read-only — doesn't influence routing decisions

**Problem with current approach**: Health monitoring is observational only. When a shard degrades, requests continue to be routed to it until it fully fails. There's no circuit breaker to fast-fail, no bulkhead to prevent thread pool exhaustion, and no health-aware routing to bypass degraded shards.

## Proposed Solution

```csharp
// Per-shard circuit breaker pipeline behavior
services.AddEncinaSharding<Order>(options => { ... })
    .WithPerShardResilience(resilience => {
        resilience.CircuitBreaker(cb => {
            cb.FailureThreshold = 5;
            cb.DurationOfBreak = TimeSpan.FromSeconds(30);
            cb.SamplingDuration = TimeSpan.FromSeconds(60);
        });
        resilience.Retry(retry => {
            retry.MaxRetries = 3;
            retry.BackoffType = BackoffType.Exponential;
            retry.UseJitter = true;
        });
        resilience.Bulkhead(bulkhead => {
            bulkhead.MaxConcurrency = 10;
            bulkhead.MaxQueueSize = 20;
        });
        resilience.Timeout(timeout => {
            timeout.Timeout = TimeSpan.FromSeconds(5);
        });
    });

// Per-shard resilience pipeline — one Polly pipeline per (EntityType, ShardId)
public interface IShardResiliencePipelineProvider
{
    ResiliencePipeline GetPipeline(string shardId);
    ResiliencePipeline<TResult> GetPipeline<TResult>(string shardId);
    CircuitBreakerState GetCircuitState(string shardId);
}

// Health-aware shard router — wraps IShardRouter, excludes unhealthy shards
public interface IHealthAwareShardRouter : IShardRouter
{
    Task<Either<EncinaError, string>> RouteToHealthyShardAsync<T>(
        T entity, CancellationToken ct) where T : IShardable;
    IReadOnlyList<ShardHealthStatus> GetShardStatuses();
}

public record ShardHealthStatus(
    string ShardId,
    CircuitBreakerState CircuitState,
    int ActiveRequests,     // Bulkhead in-use
    int QueuedRequests,     // Bulkhead queued
    TimeSpan AvgLatency,
    DateTime LastFailureUtc);

// Pipeline behavior that applies per-shard resilience
public class ShardResiliencePipelineBehavior<TRequest, TResponse>
    : IPipelineBehavior<TRequest, TResponse>
{
    // Extracts shard ID from request context, applies the per-shard pipeline
}
```

**Key components**:
- `IShardResiliencePipelineProvider`: manages one Polly `ResiliencePipeline` per shard ID
- `IHealthAwareShardRouter`: decorates `IShardRouter`, excludes shards with open circuit breakers
- `ShardResiliencePipelineBehavior<,>`: MediatR pipeline behavior applying per-shard resilience
- Per-shard retry with independent exponential backoff + jitter
- Bulkhead isolation: separate concurrency pools per shard
- Integration with existing `ShardHealthResult` and `IShardedDatabaseHealthMonitor`

## Alternatives Considered

1. **Global circuit breaker**: One breaker for all shards. Defeats the purpose — one bad shard trips the breaker for all shards, causing total outage.
2. **Database-level health checks only**: Reactive, not proactive. Detects failure after it happens but doesn't prevent cascading failures during degradation.
3. **Custom resilience without Polly**: Build from scratch. Polly is the .NET standard for resilience, well-tested, and already an Encina dependency.

## Affected Packages

- [x] Encina (core) — `IHealthAwareShardRouter`, `ShardHealthStatus`, shard routing integration
- [x] Encina.Polly — `IShardResiliencePipelineProvider`, `ShardResiliencePipelineBehavior`, per-shard pipeline management
- [x] Encina.OpenTelemetry — Circuit breaker state metrics, per-shard resilience traces

## Provider Implementation Matrix

> This feature is primarily in `Encina.Polly` and `Encina` (core) — it's a decorator/behavior pattern that wraps existing routing, not provider-specific database code.
>
> No 13-provider matrix needed. However, integration tests must verify that per-shard resilience works correctly with connections from all provider types.

## Observability

### Metrics (OpenTelemetry)

| Metric Name | Type | Description |
|-------------|------|-------------|
| `encina.sharding.resilience.circuit_state_changes_total` | Counter | Circuit breaker state transitions, tagged by `shard.id`, `from_state`, `to_state` |
| `encina.sharding.resilience.circuit_open_duration_ms` | Histogram | Duration each circuit breaker stays open, tagged by `shard.id` |
| `encina.sharding.resilience.retry_total` | Counter | Retry attempts per shard, tagged by `shard.id`, `attempt_number` |
| `encina.sharding.resilience.bulkhead_active_count` | Gauge | Active requests in bulkhead per shard, tagged by `shard.id` |
| `encina.sharding.resilience.bulkhead_queued_count` | Gauge | Queued requests in bulkhead per shard, tagged by `shard.id` |
| `encina.sharding.resilience.bulkhead_rejected_total` | Counter | Requests rejected by bulkhead (queue full), tagged by `shard.id` |
| `encina.sharding.resilience.timeout_total` | Counter | Timeout occurrences per shard, tagged by `shard.id` |
| `encina.sharding.resilience.healthy_shard_count` | Gauge | Number of shards with closed circuit breakers |

### Traces / Spans

| Span Name / Attribute | Applied To | Description |
|----------------------|------------|-------------|
| `encina.sharding.resilience.execute` | New span | Per-shard resilience pipeline execution |
| `shard.id` | `encina.sharding.resilience.execute` | Target shard |
| `shard.circuit.state` | `encina.sharding.resilience.execute` | Circuit state at execution time (Closed/Open/HalfOpen) |
| `shard.resilience.outcome` | `encina.sharding.resilience.execute` | Success, Retry, CircuitOpen, BulkheadRejected, Timeout |
| `shard.resilience.retry_count` | `encina.sharding.resilience.execute` | Number of retries before success/failure |

### Health Checks

- `ShardCircuitBreakerHealthCheck`: Reports unhealthy if any shard has open circuit breaker for longer than configurable threshold. Degraded if any shard is in half-open state.
- `ShardBulkheadHealthCheck`: Reports degraded if any shard bulkhead queue is above 80% capacity.

### Structured Logging

- `ShardCircuitOpened` (Warning): Shard ID, failure count, break duration, last error
- `ShardCircuitClosed` (Information): Shard ID, duration in open state, successful probe count
- `ShardCircuitHalfOpen` (Information): Shard ID, probing for recovery
- `ShardBulkheadRejected` (Warning): Shard ID, active count, queue count, max concurrency
- `ShardRetryAttempt` (Debug): Shard ID, attempt number, delay, exception type
- `ShardExcludedFromRouting` (Warning): Shard ID, reason (circuit open / health check failed)

## Test Matrix

> Per CLAUDE.md Testing Standards

| Test Type | Required? | Scope | Notes |
|-----------|:---------:|-------|-------|
| **UnitTests** | Required | Pipeline provider, health-aware router, circuit state management | Mock Polly pipelines, test routing exclusion logic |
| **GuardTests** | Required | `ShardResiliencePipelineProvider`, configuration builders | Null shard IDs, invalid thresholds, zero concurrency |
| **ContractTests** | Required | `IShardResiliencePipelineProvider`, `IHealthAwareShardRouter` | Verify decorator contract: wraps any `IShardRouter` |
| **PropertyTests** | Required | Circuit breaker determinism, bulkhead fairness | Same failure pattern → same circuit state always; no starvation |
| **IntegrationTests** | Required | Polly pipelines with real Polly ResiliencePipeline instances | Real circuit breakers, verify state transitions |
| **LoadTests** | Required | Concurrent shard access with failing shards | Verify isolation: healthy shards unaffected by degraded ones |
| **BenchmarkTests** | Justify | Pipeline overhead is Polly's concern, not ours | Benchmark per-shard pipeline lookup overhead only |

## Implementation Tasks

### Core Abstractions
- [ ] `IShardResiliencePipelineProvider` interface — pipeline per shard
- [ ] `IHealthAwareShardRouter` interface — extends `IShardRouter` with health awareness
- [ ] `ShardHealthStatus` record — shard health snapshot
- [ ] `ShardResilienceOptions` — configuration for circuit breaker, retry, bulkhead, timeout
- [ ] `HealthAwareShardRouter` — decorator implementation, excludes shards with open circuits
- [ ] Integration with existing `IShardedDatabaseHealthMonitor` and `ShardHealthResult`

### Polly Integration (Encina.Polly)
- [ ] `ShardResiliencePipelineProvider` — manages `ConcurrentDictionary<string, ResiliencePipeline>`
- [ ] `ShardResiliencePipelineBehavior<TRequest, TResponse>` — MediatR pipeline behavior
- [ ] Per-shard Polly pipeline builder: circuit breaker + retry + bulkhead + timeout composition
- [ ] `WithPerShardResilience()` fluent extension method on sharding builder
- [ ] DI registration: `AddPerShardResilience()` in `ServiceCollectionExtensions`

### Observability
- [ ] Counters: circuit state changes, retries, bulkhead rejections, timeouts
- [ ] Gauges: bulkhead active/queued count, healthy shard count
- [ ] Histogram: circuit open duration
- [ ] Trace attributes on resilience execution span
- [ ] Health checks: `ShardCircuitBreakerHealthCheck`, `ShardBulkheadHealthCheck`
- [ ] Structured logging events (6 events as specified)

### Testing
- [ ] UnitTests: Pipeline provider, health-aware router, circuit state management
- [ ] GuardTests: Null shard IDs, invalid configuration values
- [ ] ContractTests: `IShardResiliencePipelineProvider` and `IHealthAwareShardRouter` contracts
- [ ] PropertyTests: Circuit breaker determinism, bulkhead fairness under random failure patterns
- [ ] IntegrationTests: Real Polly pipelines, state transitions, health-aware routing
- [ ] LoadTests: Concurrent shard access with selective shard failures

### Documentation
- [ ] XML doc comments on all public resilience APIs
- [ ] Usage guide: per-shard resilience configuration, circuit breaker tuning, bulkhead sizing
- [ ] Architecture guide: resilience pipeline composition diagram
- [ ] CHANGELOG.md update

## Acceptance Criteria

- [ ] `IShardResiliencePipelineProvider` provides independent pipelines per shard
- [ ] Circuit breakers operate independently per shard (shard-3 open doesn't affect shard-1)
- [ ] `IHealthAwareShardRouter` excludes shards with open circuit breakers
- [ ] Retry policies apply per-shard with independent backoff
- [ ] Bulkhead isolation prevents one shard from exhausting thread pools
- [ ] Circuit state transitions are observable via metrics and logs
- [ ] Integration with existing `IShardedDatabaseHealthMonitor` works
- [ ] OpenTelemetry metrics, traces, and health checks added
- [ ] All test types per Test Matrix implemented (or justified)
- [ ] Zero build warnings
- [ ] Code coverage >= 85%

## Documentation

- [ ] XML doc comments with `<example>` tags
- [ ] Usage guide: per-shard resilience patterns, tuning guidance
- [ ] Architecture: resilience pipeline composition and routing flow
- [ ] CHANGELOG.md update

## Related Issues

- #289 - Database Sharding Abstractions (parent feature — completed)
- #308 - Polly Integration (dependency — provides base Polly infrastructure)
- #644 - Read/Write Separation (per-replica resilience extends this pattern)
- #648 - Online Resharding (resharding benefits from circuit breaker protection during migration)
- [Sharding Enhancement Study](docs/plans/sharding-enhancement-study.md) — Enhancement #6, Tier 1


Metric Name	Type	Description
`encina.sharding.resilience.circuit_state_changes_total`	Counter	Circuit breaker state transitions, tagged by `shard.id`, `from_state`, `to_state`
`encina.sharding.resilience.circuit_open_duration_ms`	Histogram	Duration each circuit breaker stays open, tagged by `shard.id`
`encina.sharding.resilience.retry_total`	Counter	Retry attempts per shard, tagged by `shard.id`, `attempt_number`
`encina.sharding.resilience.bulkhead_active_count`	Gauge	Active requests in bulkhead per shard, tagged by `shard.id`
`encina.sharding.resilience.bulkhead_queued_count`	Gauge	Queued requests in bulkhead per shard, tagged by `shard.id`
`encina.sharding.resilience.bulkhead_rejected_total`	Counter	Requests rejected by bulkhead (queue full), tagged by `shard.id`
`encina.sharding.resilience.timeout_total`	Counter	Timeout occurrences per shard, tagged by `shard.id`
`encina.sharding.resilience.healthy_shard_count`	Gauge	Number of shards with closed circuit breakers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Per-Shard Resilience (Circuit Breakers + Retry) #643

Summary

Motivation

Proposed Solution

Alternatives Considered

Affected Packages

Provider Implementation Matrix

Observability

Metrics (OpenTelemetry)

Traces / Spans

Health Checks

Structured Logging

Test Matrix

Implementation Tasks

Core Abstractions

Polly Integration (Encina.Polly)

Observability

Testing

Documentation

Acceptance Criteria

Documentation

Related Issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Span Name / Attribute	Applied To	Description
`encina.sharding.resilience.execute`	New span	Per-shard resilience pipeline execution
`shard.id`	`encina.sharding.resilience.execute`	Target shard
`shard.circuit.state`	`encina.sharding.resilience.execute`	Circuit state at execution time (Closed/Open/HalfOpen)
`shard.resilience.outcome`	`encina.sharding.resilience.execute`	Success, Retry, CircuitOpen, BulkheadRejected, Timeout
`shard.resilience.retry_count`	`encina.sharding.resilience.execute`	Number of retries before success/failure

Test Type	Required?	Scope	Notes
UnitTests	Required	Pipeline provider, health-aware router, circuit state management	Mock Polly pipelines, test routing exclusion logic
GuardTests	Required	`ShardResiliencePipelineProvider`, configuration builders	Null shard IDs, invalid thresholds, zero concurrency
ContractTests	Required	`IShardResiliencePipelineProvider`, `IHealthAwareShardRouter`	Verify decorator contract: wraps any `IShardRouter`
PropertyTests	Required	Circuit breaker determinism, bulkhead fairness	Same failure pattern → same circuit state always; no starvation
IntegrationTests	Required	Polly pipelines with real Polly ResiliencePipeline instances	Real circuit breakers, verify state transitions
LoadTests	Required	Concurrent shard access with failing shards	Verify isolation: healthy shards unaffected by degraded ones
BenchmarkTests	Justify	Pipeline overhead is Polly's concern, not ours	Benchmark per-shard pipeline lookup overhead only

[FEATURE] Per-Shard Resilience (Circuit Breakers + Retry) #643

Description

Summary

Motivation

Proposed Solution

Alternatives Considered

Affected Packages

Provider Implementation Matrix

Observability

Metrics (OpenTelemetry)

Traces / Spans

Health Checks

Structured Logging

Test Matrix

Implementation Tasks

Core Abstractions

Polly Integration (Encina.Polly)

Observability

Testing

Documentation

Acceptance Criteria

Documentation

Related Issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions