Skip to content

Commit d6ef4f7

Browse files
authored
Merge pull request #798 from Chris0Jeky/ops/cost-guardrails-budget-observability
OPS-12: Cloud cost observability and budget-guardrail automation
2 parents 6f47d6c + 4d3af45 commit d6ef4f7

File tree

7 files changed

+3199
-2452
lines changed

7 files changed

+3199
-2452
lines changed

docs/IMPLEMENTATION_MASTERPLAN.md

Lines changed: 1285 additions & 1290 deletions
Large diffs are not rendered by default.

docs/STATUS.md

Lines changed: 1141 additions & 1145 deletions
Large diffs are not rendered by default.
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# ADR-0026: Cloud Cost Observability and Budget Guardrails
2+
3+
- **Status**: Accepted
4+
- **Date**: 2026-04-09
5+
- **Deciders**: Project maintainers
6+
7+
## Context
8+
9+
Taskdeck is transitioning from a purely local-first SQLite tool to a cloud-hosted deployment model (see ADR-0014, platform expansion strategy). Cloud hosting introduces ongoing variable costs that do not exist in local-first operation: compute instances, LLM API calls, storage growth, logging/telemetry volume, network egress, and DNS/domain hosting.
10+
11+
Three characteristics make proactive cost observability essential:
12+
13+
1. **LLM API calls are high-variance**: A single user session with tool-calling can generate 5+ provider round-trips. OpenAI GPT-4o-mini and Gemini 2.5 Flash have different pricing structures, so they must be tracked separately rather than treated as equivalent. The GPT-4o-mini reference model in SPIKE_618 cost roughly $0.00088 per 3-round conversation, but that estimate is only a baseline.
14+
15+
2. **Local-first heritage means no existing cloud cost discipline**: The team has never operated cloud infrastructure at scale. Without explicit budget guardrails, cost surprises are likely during the v0.2.0 cloud launch.
16+
17+
3. **Several features have high-variance cost scaling**: LLM token consumption grows faster than request count when tool-calling multiplies per-message cost, logging volume scales with request count and verbosity configuration, and database storage grows continuously with audit trail accumulation. Even linearly-scaling features like SignalR connections become cost-relevant at scale.
18+
19+
Issue #104 (OPS-12) requires establishing cost visibility, budget alerting, and mitigation playbooks before cloud deployment begins.
20+
21+
## Decision
22+
23+
Establish a proactive cloud cost observability framework with three layers:
24+
25+
1. **Cost telemetry and dashboards**: Define cost dimensions (compute, storage, LLM API, logging, network, CI/CD), track them through cloud provider billing APIs and application-level metrics, and maintain a monthly cost review workflow.
26+
27+
2. **Budget alert thresholds**: Implement tiered alerting at 70% (warning), 90% (critical), and 100% (hard cap) of monthly budget. Alerts route to documented owners with escalation paths.
28+
29+
3. **Feature-level cost hotspot registry**: Maintain a living document mapping high-variance features to their cost drivers, scaling behavior, mitigation levers, and action owners. This registry is reviewed monthly alongside the cost dashboard.
30+
31+
Supporting artifacts:
32+
- `docs/ops/CLOUD_COST_OBSERVABILITY.md` - framework, dimensions, review workflow
33+
- `docs/ops/COST_HOTSPOT_REGISTRY.md` - feature-level cost risk tracking
34+
- `docs/ops/BUDGET_BREACH_RUNBOOK.md` - detection-to-resolution playbook
35+
36+
## Alternatives Considered
37+
38+
- **Reactive-only cost management**: Wait for cost surprises and address them as incidents. Rejected because LLM API costs can spike rapidly (a bug enabling unbounded tool-calling loops could exhaust a monthly budget in hours), and cloud provider billing is typically delayed 4-24 hours.
39+
40+
- **Third-party cost management platform (e.g., Kubecost, Vantage, CloudHealth)**: Adds operational complexity and cost. The current single-node deployment (see `docs/ops/DEPLOYMENT_TERRAFORM_BASELINE.md`) does not justify a dedicated cost management tool. Revisit when multi-node or multi-cloud deployment is in scope.
41+
42+
- **Cloud provider native budgets only (AWS Budgets)**: Necessary but insufficient. AWS Budgets alone cannot correlate application-level behavior (e.g., which feature or user is driving LLM cost) with billing data. The framework uses provider budgets as the alerting backbone while adding application-level cost attribution.
43+
44+
- **Hard spending caps with automatic shutdown**: Too aggressive for a product with active users. The framework uses graduated mitigation (rate-limit, degrade, scale-down) rather than hard shutdown, preserving non-LLM functionality during cost incidents.
45+
46+
## Consequences
47+
48+
**Positive**:
49+
- Cost surprises during v0.2.0 cloud launch are caught early through tiered alerts.
50+
- Monthly review cadence creates institutional knowledge about cost trends before they become emergencies.
51+
- Feature owners have explicit accountability for cost-impacting decisions.
52+
- Budget breach runbook reduces mean-time-to-mitigate for cost incidents.
53+
54+
**Negative**:
55+
- Monthly review workflow adds operational overhead (estimated 30-60 minutes per review).
56+
- Cost estimates in the hotspot registry are approximations that require calibration against real production data.
57+
- Alert thresholds may need tuning during initial cloud operation - too sensitive causes alert fatigue, too loose defeats the purpose.
58+
59+
**Neutral**:
60+
- Cost observability artifacts become part of the ops documentation surface that must be maintained alongside infrastructure changes.
61+
- The framework is cloud-provider-aware (AWS-focused given the Terraform baseline) but the principles are portable.
62+
63+
## References
64+
65+
- Issue: #104 (OPS-12: Cloud cost observability and budget-guardrail automation)
66+
- Terraform baseline: `docs/ops/DEPLOYMENT_TERRAFORM_BASELINE.md` (#102)
67+
- Observability baseline: `docs/ops/OBSERVABILITY_BASELINE.md` (#68)
68+
- LLM cost context: `docs/spikes/SPIKE_618_COMPLETED.md` (tool-calling cost model)
69+
- Managed-key quota policy: `docs/security/MANAGED_KEY_USAGE_POLICY.md` (#240)
70+
- Platform expansion strategy: ADR-0014
71+
- Disaster recovery runbook: `docs/ops/DISASTER_RECOVERY_RUNBOOK.md` (#86)

docs/decisions/INDEX.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -5,29 +5,29 @@
55
| [0001](ADR-0001-clean-architecture-layering.md) | Clean Architecture Layering | Accepted | 2025 |
66
| [0002](ADR-0002-claims-first-identity.md) | Claims-First Identity Model | Accepted | 2026-01 |
77
| [0003](ADR-0003-proposal-first-automation.md) | Proposal-First Automation (Review-First Safety) | Accepted | 2026-02-23 |
8-
| [0004](ADR-0004-multi-tenancy-shared-schema.md) | Multi-Tenancy Shared Schema + TenantId | Accepted | 2026-02-22 |
9-
| [0005](ADR-0005-capture-model-queue-wrapper.md) | Capture Model Queue-Wrapper MVP | Accepted | 2026-02-23 |
10-
| [0006](ADR-0006-llm-provider-mock-default.md) | LLM Provider Mock-Default with Config-Gated Live Providers | Accepted | 2026-02 |
8+
| [0004](ADR-0004-multi-tenancy-shared-schema.md) | Multi-Tenancy - Shared Schema + TenantId | Accepted | 2026-02-22 |
9+
| [0005](ADR-0005-capture-model-queue-wrapper.md) | Capture Model - Queue-Wrapper MVP | Accepted | 2026-02-23 |
10+
| [0006](ADR-0006-llm-provider-mock-default.md) | LLM Provider - Mock-Default with Config-Gated Live Providers | Accepted | 2026-02 |
1111
| [0007](ADR-0007-stable-error-contracts.md) | Stable Error Contracts (ApiErrorResponse) | Accepted | 2026-01 |
1212
| [0008](ADR-0008-novice-first-product-legibility.md) | Novice-First Product Legibility Before Breadth | Accepted | 2026-03-07 |
13-
| [0009](ADR-0009-session-token-storage.md) | Session Token Storage localStorage with Mitigations | Accepted | 2026-03-28 |
14-
| [0010](ADR-0010-frontend-primitive-stack-shadcn-vue.md) | Frontend Primitive Stack shadcn-vue | Accepted | 2026-03-28 |
15-
| [0011](ADR-0011-design-tokens-obsidian-ember.md) | Design Token System Obsidian & Ember Theme | Accepted | 2026-02-23 |
13+
| [0009](ADR-0009-session-token-storage.md) | Session Token Storage - localStorage with Mitigations | Accepted | 2026-03-28 |
14+
| [0010](ADR-0010-frontend-primitive-stack-shadcn-vue.md) | Frontend Primitive Stack - shadcn-vue | Accepted | 2026-03-28 |
15+
| [0011](ADR-0011-design-tokens-obsidian-ember.md) | Design Token System - Obsidian & Ember Theme | Accepted | 2026-02-23 |
1616
| [0012](ADR-0012-signalr-realtime-with-polling-fallback.md) | SignalR Realtime with Polling Fallback | Accepted | 2026-02 |
17-
| [0013](ADR-0013-ci-topology-reusable-workflows.md) | CI Topology Reusable Workflow Decomposition | Accepted | 2026-03 |
18-
| [0014](ADR-0014-platform-expansion-four-pillars.md) | Platform Expansion Four Pillars | Proposed | 2026-03-29 |
19-
| [0015](ADR-0015-starter-pack-idempotent-apply.md) | Starter Pack Idempotent Apply with Conflict Detection | Accepted | 2026-02 |
17+
| [0013](ADR-0013-ci-topology-reusable-workflows.md) | CI Topology - Reusable Workflow Decomposition | Accepted | 2026-03 |
18+
| [0014](ADR-0014-platform-expansion-four-pillars.md) | Platform Expansion - Four Pillars | Proposed | 2026-03-29 |
19+
| [0015](ADR-0015-starter-pack-idempotent-apply.md) | Starter Pack - Idempotent Apply with Conflict Detection | Accepted | 2026-02 |
2020
| [0016](ADR-0016-security-logging-redaction.md) | Security Logging Redaction for Sensitive Flows | Accepted | 2026-02-23 |
21-
| [0017](ADR-0017-agent-tool-registry-review-first.md) | Agent Tool Registry Review-First by Default | Accepted | 2026-03 |
22-
| [0018](ADR-0018-llm-tool-calling-custom-over-semantic-kernel.md) | LLM Tool-Calling Custom Implementation over Semantic Kernel | Accepted | 2026-04-01 |
23-
| [0019](ADR-0019-mcp-server-official-sdk-embedded-hosting.md) | MCP Server Official SDK with Embedded Hosting | Accepted | 2026-04-01 |
21+
| [0017](ADR-0017-agent-tool-registry-review-first.md) | Agent Tool Registry - Review-First by Default | Accepted | 2026-03 |
22+
| [0018](ADR-0018-llm-tool-calling-custom-over-semantic-kernel.md) | LLM Tool-Calling - Custom Implementation over Semantic Kernel | Accepted | 2026-04-01 |
23+
| [0019](ADR-0019-mcp-server-official-sdk-embedded-hosting.md) | MCP Server - Official SDK with Embedded Hosting | Accepted | 2026-04-01 |
2424
| [0020](ADR-0020-plugin-extension-architecture.md) | Plugin/Extension Architecture RFC and Sandboxing Constraints | Proposed | 2026-04-01 |
25-
| [0021](ADR-0021-jwt-invalidation-user-active-middleware.md) | JWT Invalidation User-Active Middleware over Token Blocklist | Accepted | 2026-04-03 |
26-
| [0022](ADR-0022-analytics-export-csv-first-pdf-deferred.md) | Analytics Export CSV First, PDF Deferred | Accepted | 2026-04-08 |
25+
| [0021](ADR-0021-jwt-invalidation-user-active-middleware.md) | JWT Invalidation - User-Active Middleware over Token Blocklist | Accepted | 2026-04-03 |
26+
| [0022](ADR-0022-analytics-export-csv-first-pdf-deferred.md) | Analytics Export - CSV First, PDF Deferred | Accepted | 2026-04-08 |
2727
| [0023](ADR-0023-sqlite-to-postgresql-migration-strategy.md) | SQLite-to-PostgreSQL Migration Strategy | Accepted | 2026-04-09 |
28-
| [0024](ADR-0024-distributed-caching-cache-aside.md) | Distributed Caching Cache-Aside with Redis/InMemory Fallback | Accepted | 2026-04-09 |
29-
| [0025](ADR-0025-signalr-scaleout-redis-backplane.md) | SignalR Scale-Out Redis Backplane | Accepted | 2026-04-09 |
28+
| [0024](ADR-0024-distributed-caching-cache-aside.md) | Distributed Caching - Cache-Aside with Redis/InMemory Fallback | Accepted | 2026-04-09 |
29+
| [0025](ADR-0025-signalr-scaleout-redis-backplane.md) | SignalR Scale-Out - Redis Backplane | Accepted | 2026-04-09 |
3030
| [0026](ADR-0026-cloud-cost-observability.md) | Cloud Cost Observability and Budget Guardrails | Accepted | 2026-04-09 |
3131
| [0027](ADR-0027-cloud-target-topology-autoscaling.md) | Cloud Target Topology and Autoscaling Reference Architecture | Accepted | 2026-04-09 |
32-
| [0028](ADR-0028-staged-deployment-bluegreen-canary.md) | Staged Deployment Blue/Green with Canary Verification | Accepted | 2026-04-09 |
32+
| [0028](ADR-0028-staged-deployment-bluegreen-canary.md) | Staged Deployment - Blue/Green with Canary Verification | Accepted | 2026-04-09 |
3333
| [0029](ADR-0029-oidc-mfa-pluggable-identity.md) | OIDC/SSO Integration with Optional TOTP MFA | Accepted | 2026-04-09 |

0 commit comments

Comments
 (0)