Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2,575 changes: 1,285 additions & 1,290 deletions docs/IMPLEMENTATION_MASTERPLAN.md

Large diffs are not rendered by default.

2,286 changes: 1,141 additions & 1,145 deletions docs/STATUS.md

Large diffs are not rendered by default.

71 changes: 71 additions & 0 deletions docs/decisions/ADR-0026-cloud-cost-observability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# ADR-0026: Cloud Cost Observability and Budget Guardrails

- **Status**: Accepted
- **Date**: 2026-04-09
- **Deciders**: Project maintainers

## Context

Taskdeck is transitioning from a purely local-first SQLite tool to a cloud-hosted deployment model (see ADR-0014, platform expansion strategy). Cloud hosting introduces ongoing variable costs that do not exist in local-first operation: compute instances, LLM API calls, storage growth, logging/telemetry volume, network egress, and DNS/domain hosting.

Three characteristics make proactive cost observability essential:

1. **LLM API calls are high-variance**: A single user session with tool-calling can generate 5+ provider round-trips. OpenAI GPT-4o-mini and Gemini 2.5 Flash have different pricing structures, so they must be tracked separately rather than treated as equivalent. The GPT-4o-mini reference model in SPIKE_618 cost roughly $0.00088 per 3-round conversation, but that estimate is only a baseline.

2. **Local-first heritage means no existing cloud cost discipline**: The team has never operated cloud infrastructure at scale. Without explicit budget guardrails, cost surprises are likely during the v0.2.0 cloud launch.

3. **Several features have high-variance cost scaling**: LLM token consumption grows faster than request count when tool-calling multiplies per-message cost, logging volume scales with request count and verbosity configuration, and database storage grows continuously with audit trail accumulation. Even linearly-scaling features like SignalR connections become cost-relevant at scale.

Issue #104 (OPS-12) requires establishing cost visibility, budget alerting, and mitigation playbooks before cloud deployment begins.

## Decision

Establish a proactive cloud cost observability framework with three layers:

1. **Cost telemetry and dashboards**: Define cost dimensions (compute, storage, LLM API, logging, network, CI/CD), track them through cloud provider billing APIs and application-level metrics, and maintain a monthly cost review workflow.

2. **Budget alert thresholds**: Implement tiered alerting at 70% (warning), 90% (critical), and 100% (hard cap) of monthly budget. Alerts route to documented owners with escalation paths.

3. **Feature-level cost hotspot registry**: Maintain a living document mapping high-variance features to their cost drivers, scaling behavior, mitigation levers, and action owners. This registry is reviewed monthly alongside the cost dashboard.

Supporting artifacts:
- `docs/ops/CLOUD_COST_OBSERVABILITY.md` - framework, dimensions, review workflow
- `docs/ops/COST_HOTSPOT_REGISTRY.md` - feature-level cost risk tracking
- `docs/ops/BUDGET_BREACH_RUNBOOK.md` - detection-to-resolution playbook

## Alternatives Considered

- **Reactive-only cost management**: Wait for cost surprises and address them as incidents. Rejected because LLM API costs can spike rapidly (a bug enabling unbounded tool-calling loops could exhaust a monthly budget in hours), and cloud provider billing is typically delayed 4-24 hours.

- **Third-party cost management platform (e.g., Kubecost, Vantage, CloudHealth)**: Adds operational complexity and cost. The current single-node deployment (see `docs/ops/DEPLOYMENT_TERRAFORM_BASELINE.md`) does not justify a dedicated cost management tool. Revisit when multi-node or multi-cloud deployment is in scope.

- **Cloud provider native budgets only (AWS Budgets)**: Necessary but insufficient. AWS Budgets alone cannot correlate application-level behavior (e.g., which feature or user is driving LLM cost) with billing data. The framework uses provider budgets as the alerting backbone while adding application-level cost attribution.

- **Hard spending caps with automatic shutdown**: Too aggressive for a product with active users. The framework uses graduated mitigation (rate-limit, degrade, scale-down) rather than hard shutdown, preserving non-LLM functionality during cost incidents.

## Consequences

**Positive**:
- Cost surprises during v0.2.0 cloud launch are caught early through tiered alerts.
- Monthly review cadence creates institutional knowledge about cost trends before they become emergencies.
- Feature owners have explicit accountability for cost-impacting decisions.
- Budget breach runbook reduces mean-time-to-mitigate for cost incidents.

**Negative**:
- Monthly review workflow adds operational overhead (estimated 30-60 minutes per review).
- Cost estimates in the hotspot registry are approximations that require calibration against real production data.
- Alert thresholds may need tuning during initial cloud operation - too sensitive causes alert fatigue, too loose defeats the purpose.

**Neutral**:
- Cost observability artifacts become part of the ops documentation surface that must be maintained alongside infrastructure changes.
- The framework is cloud-provider-aware (AWS-focused given the Terraform baseline) but the principles are portable.

## References

- Issue: #104 (OPS-12: Cloud cost observability and budget-guardrail automation)
- Terraform baseline: `docs/ops/DEPLOYMENT_TERRAFORM_BASELINE.md` (#102)
- Observability baseline: `docs/ops/OBSERVABILITY_BASELINE.md` (#68)
- LLM cost context: `docs/spikes/SPIKE_618_COMPLETED.md` (tool-calling cost model)
- Managed-key quota policy: `docs/security/MANAGED_KEY_USAGE_POLICY.md` (#240)
- Platform expansion strategy: ADR-0014
- Disaster recovery runbook: `docs/ops/DISASTER_RECOVERY_RUNBOOK.md` (#86)
34 changes: 17 additions & 17 deletions docs/decisions/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,29 +5,29 @@
| [0001](ADR-0001-clean-architecture-layering.md) | Clean Architecture Layering | Accepted | 2025 |
| [0002](ADR-0002-claims-first-identity.md) | Claims-First Identity Model | Accepted | 2026-01 |
| [0003](ADR-0003-proposal-first-automation.md) | Proposal-First Automation (Review-First Safety) | Accepted | 2026-02-23 |
| [0004](ADR-0004-multi-tenancy-shared-schema.md) | Multi-Tenancy Shared Schema + TenantId | Accepted | 2026-02-22 |
| [0005](ADR-0005-capture-model-queue-wrapper.md) | Capture Model Queue-Wrapper MVP | Accepted | 2026-02-23 |
| [0006](ADR-0006-llm-provider-mock-default.md) | LLM Provider Mock-Default with Config-Gated Live Providers | Accepted | 2026-02 |
| [0004](ADR-0004-multi-tenancy-shared-schema.md) | Multi-Tenancy - Shared Schema + TenantId | Accepted | 2026-02-22 |
| [0005](ADR-0005-capture-model-queue-wrapper.md) | Capture Model - Queue-Wrapper MVP | Accepted | 2026-02-23 |
| [0006](ADR-0006-llm-provider-mock-default.md) | LLM Provider - Mock-Default with Config-Gated Live Providers | Accepted | 2026-02 |
| [0007](ADR-0007-stable-error-contracts.md) | Stable Error Contracts (ApiErrorResponse) | Accepted | 2026-01 |
| [0008](ADR-0008-novice-first-product-legibility.md) | Novice-First Product Legibility Before Breadth | Accepted | 2026-03-07 |
| [0009](ADR-0009-session-token-storage.md) | Session Token Storage localStorage with Mitigations | Accepted | 2026-03-28 |
| [0010](ADR-0010-frontend-primitive-stack-shadcn-vue.md) | Frontend Primitive Stack shadcn-vue | Accepted | 2026-03-28 |
| [0011](ADR-0011-design-tokens-obsidian-ember.md) | Design Token System Obsidian & Ember Theme | Accepted | 2026-02-23 |
| [0009](ADR-0009-session-token-storage.md) | Session Token Storage - localStorage with Mitigations | Accepted | 2026-03-28 |
| [0010](ADR-0010-frontend-primitive-stack-shadcn-vue.md) | Frontend Primitive Stack - shadcn-vue | Accepted | 2026-03-28 |
| [0011](ADR-0011-design-tokens-obsidian-ember.md) | Design Token System - Obsidian & Ember Theme | Accepted | 2026-02-23 |
| [0012](ADR-0012-signalr-realtime-with-polling-fallback.md) | SignalR Realtime with Polling Fallback | Accepted | 2026-02 |
| [0013](ADR-0013-ci-topology-reusable-workflows.md) | CI Topology Reusable Workflow Decomposition | Accepted | 2026-03 |
| [0014](ADR-0014-platform-expansion-four-pillars.md) | Platform Expansion Four Pillars | Proposed | 2026-03-29 |
| [0015](ADR-0015-starter-pack-idempotent-apply.md) | Starter Pack Idempotent Apply with Conflict Detection | Accepted | 2026-02 |
| [0013](ADR-0013-ci-topology-reusable-workflows.md) | CI Topology - Reusable Workflow Decomposition | Accepted | 2026-03 |
| [0014](ADR-0014-platform-expansion-four-pillars.md) | Platform Expansion - Four Pillars | Proposed | 2026-03-29 |
| [0015](ADR-0015-starter-pack-idempotent-apply.md) | Starter Pack - Idempotent Apply with Conflict Detection | Accepted | 2026-02 |
| [0016](ADR-0016-security-logging-redaction.md) | Security Logging Redaction for Sensitive Flows | Accepted | 2026-02-23 |
| [0017](ADR-0017-agent-tool-registry-review-first.md) | Agent Tool Registry Review-First by Default | Accepted | 2026-03 |
| [0018](ADR-0018-llm-tool-calling-custom-over-semantic-kernel.md) | LLM Tool-Calling Custom Implementation over Semantic Kernel | Accepted | 2026-04-01 |
| [0019](ADR-0019-mcp-server-official-sdk-embedded-hosting.md) | MCP Server Official SDK with Embedded Hosting | Accepted | 2026-04-01 |
| [0017](ADR-0017-agent-tool-registry-review-first.md) | Agent Tool Registry - Review-First by Default | Accepted | 2026-03 |
| [0018](ADR-0018-llm-tool-calling-custom-over-semantic-kernel.md) | LLM Tool-Calling - Custom Implementation over Semantic Kernel | Accepted | 2026-04-01 |
| [0019](ADR-0019-mcp-server-official-sdk-embedded-hosting.md) | MCP Server - Official SDK with Embedded Hosting | Accepted | 2026-04-01 |
| [0020](ADR-0020-plugin-extension-architecture.md) | Plugin/Extension Architecture RFC and Sandboxing Constraints | Proposed | 2026-04-01 |
| [0021](ADR-0021-jwt-invalidation-user-active-middleware.md) | JWT Invalidation User-Active Middleware over Token Blocklist | Accepted | 2026-04-03 |
| [0022](ADR-0022-analytics-export-csv-first-pdf-deferred.md) | Analytics Export CSV First, PDF Deferred | Accepted | 2026-04-08 |
| [0021](ADR-0021-jwt-invalidation-user-active-middleware.md) | JWT Invalidation - User-Active Middleware over Token Blocklist | Accepted | 2026-04-03 |
| [0022](ADR-0022-analytics-export-csv-first-pdf-deferred.md) | Analytics Export - CSV First, PDF Deferred | Accepted | 2026-04-08 |
| [0023](ADR-0023-sqlite-to-postgresql-migration-strategy.md) | SQLite-to-PostgreSQL Migration Strategy | Accepted | 2026-04-09 |
| [0024](ADR-0024-distributed-caching-cache-aside.md) | Distributed Caching Cache-Aside with Redis/InMemory Fallback | Accepted | 2026-04-09 |
| [0025](ADR-0025-signalr-scaleout-redis-backplane.md) | SignalR Scale-Out Redis Backplane | Accepted | 2026-04-09 |
| [0024](ADR-0024-distributed-caching-cache-aside.md) | Distributed Caching - Cache-Aside with Redis/InMemory Fallback | Accepted | 2026-04-09 |
| [0025](ADR-0025-signalr-scaleout-redis-backplane.md) | SignalR Scale-Out - Redis Backplane | Accepted | 2026-04-09 |
| [0026](ADR-0026-cloud-cost-observability.md) | Cloud Cost Observability and Budget Guardrails | Accepted | 2026-04-09 |
| [0027](ADR-0027-cloud-target-topology-autoscaling.md) | Cloud Target Topology and Autoscaling Reference Architecture | Accepted | 2026-04-09 |
| [0028](ADR-0028-staged-deployment-bluegreen-canary.md) | Staged Deployment Blue/Green with Canary Verification | Accepted | 2026-04-09 |
| [0028](ADR-0028-staged-deployment-bluegreen-canary.md) | Staged Deployment - Blue/Green with Canary Verification | Accepted | 2026-04-09 |
| [0029](ADR-0029-oidc-mfa-pluggable-identity.md) | OIDC/SSO Integration with Optional TOTP MFA | Accepted | 2026-04-09 |
Loading
Loading