|
| 1 | +# ADR-0026: Cloud Cost Observability and Budget Guardrails |
| 2 | + |
| 3 | +- **Status**: Accepted |
| 4 | +- **Date**: 2026-04-09 |
| 5 | +- **Deciders**: Project maintainers |
| 6 | + |
| 7 | +## Context |
| 8 | + |
| 9 | +Taskdeck is transitioning from a purely local-first SQLite tool to a cloud-hosted deployment model (see ADR-0014, platform expansion strategy). Cloud hosting introduces ongoing variable costs that do not exist in local-first operation: compute instances, LLM API calls, storage growth, logging/telemetry volume, network egress, and DNS/domain hosting. |
| 10 | + |
| 11 | +Three characteristics make proactive cost observability essential: |
| 12 | + |
| 13 | +1. **LLM API calls are high-variance**: A single user session with tool-calling can generate 5+ provider round-trips. OpenAI GPT-4o-mini and Gemini 2.5 Flash have different pricing structures, so they must be tracked separately rather than treated as equivalent. The GPT-4o-mini reference model in SPIKE_618 cost roughly $0.00088 per 3-round conversation, but that estimate is only a baseline. |
| 14 | + |
| 15 | +2. **Local-first heritage means no existing cloud cost discipline**: The team has never operated cloud infrastructure at scale. Without explicit budget guardrails, cost surprises are likely during the v0.2.0 cloud launch. |
| 16 | + |
| 17 | +3. **Several features have high-variance cost scaling**: LLM token consumption grows faster than request count when tool-calling multiplies per-message cost, logging volume scales with request count and verbosity configuration, and database storage grows continuously with audit trail accumulation. Even linearly-scaling features like SignalR connections become cost-relevant at scale. |
| 18 | + |
| 19 | +Issue #104 (OPS-12) requires establishing cost visibility, budget alerting, and mitigation playbooks before cloud deployment begins. |
| 20 | + |
| 21 | +## Decision |
| 22 | + |
| 23 | +Establish a proactive cloud cost observability framework with three layers: |
| 24 | + |
| 25 | +1. **Cost telemetry and dashboards**: Define cost dimensions (compute, storage, LLM API, logging, network, CI/CD), track them through cloud provider billing APIs and application-level metrics, and maintain a monthly cost review workflow. |
| 26 | + |
| 27 | +2. **Budget alert thresholds**: Implement tiered alerting at 70% (warning), 90% (critical), and 100% (hard cap) of monthly budget. Alerts route to documented owners with escalation paths. |
| 28 | + |
| 29 | +3. **Feature-level cost hotspot registry**: Maintain a living document mapping high-variance features to their cost drivers, scaling behavior, mitigation levers, and action owners. This registry is reviewed monthly alongside the cost dashboard. |
| 30 | + |
| 31 | +Supporting artifacts: |
| 32 | +- `docs/ops/CLOUD_COST_OBSERVABILITY.md` - framework, dimensions, review workflow |
| 33 | +- `docs/ops/COST_HOTSPOT_REGISTRY.md` - feature-level cost risk tracking |
| 34 | +- `docs/ops/BUDGET_BREACH_RUNBOOK.md` - detection-to-resolution playbook |
| 35 | + |
| 36 | +## Alternatives Considered |
| 37 | + |
| 38 | +- **Reactive-only cost management**: Wait for cost surprises and address them as incidents. Rejected because LLM API costs can spike rapidly (a bug enabling unbounded tool-calling loops could exhaust a monthly budget in hours), and cloud provider billing is typically delayed 4-24 hours. |
| 39 | + |
| 40 | +- **Third-party cost management platform (e.g., Kubecost, Vantage, CloudHealth)**: Adds operational complexity and cost. The current single-node deployment (see `docs/ops/DEPLOYMENT_TERRAFORM_BASELINE.md`) does not justify a dedicated cost management tool. Revisit when multi-node or multi-cloud deployment is in scope. |
| 41 | + |
| 42 | +- **Cloud provider native budgets only (AWS Budgets)**: Necessary but insufficient. AWS Budgets alone cannot correlate application-level behavior (e.g., which feature or user is driving LLM cost) with billing data. The framework uses provider budgets as the alerting backbone while adding application-level cost attribution. |
| 43 | + |
| 44 | +- **Hard spending caps with automatic shutdown**: Too aggressive for a product with active users. The framework uses graduated mitigation (rate-limit, degrade, scale-down) rather than hard shutdown, preserving non-LLM functionality during cost incidents. |
| 45 | + |
| 46 | +## Consequences |
| 47 | + |
| 48 | +**Positive**: |
| 49 | +- Cost surprises during v0.2.0 cloud launch are caught early through tiered alerts. |
| 50 | +- Monthly review cadence creates institutional knowledge about cost trends before they become emergencies. |
| 51 | +- Feature owners have explicit accountability for cost-impacting decisions. |
| 52 | +- Budget breach runbook reduces mean-time-to-mitigate for cost incidents. |
| 53 | + |
| 54 | +**Negative**: |
| 55 | +- Monthly review workflow adds operational overhead (estimated 30-60 minutes per review). |
| 56 | +- Cost estimates in the hotspot registry are approximations that require calibration against real production data. |
| 57 | +- Alert thresholds may need tuning during initial cloud operation - too sensitive causes alert fatigue, too loose defeats the purpose. |
| 58 | + |
| 59 | +**Neutral**: |
| 60 | +- Cost observability artifacts become part of the ops documentation surface that must be maintained alongside infrastructure changes. |
| 61 | +- The framework is cloud-provider-aware (AWS-focused given the Terraform baseline) but the principles are portable. |
| 62 | + |
| 63 | +## References |
| 64 | + |
| 65 | +- Issue: #104 (OPS-12: Cloud cost observability and budget-guardrail automation) |
| 66 | +- Terraform baseline: `docs/ops/DEPLOYMENT_TERRAFORM_BASELINE.md` (#102) |
| 67 | +- Observability baseline: `docs/ops/OBSERVABILITY_BASELINE.md` (#68) |
| 68 | +- LLM cost context: `docs/spikes/SPIKE_618_COMPLETED.md` (tool-calling cost model) |
| 69 | +- Managed-key quota policy: `docs/security/MANAGED_KEY_USAGE_POLICY.md` (#240) |
| 70 | +- Platform expansion strategy: ADR-0014 |
| 71 | +- Disaster recovery runbook: `docs/ops/DISASTER_RECOVERY_RUNBOOK.md` (#86) |
0 commit comments