From 51de500b219e3ee372884c6483b491fbd3c722d2 Mon Sep 17 00:00:00 2001 From: Chris0Jeky Date: Thu, 9 Apr 2026 03:22:13 +0100 Subject: [PATCH 01/10] Add ADR-0023: Cloud cost observability and budget-guardrail automation Document the decision to establish proactive cost observability for Taskdeck's cloud transition. Covers three-layer approach (telemetry, budget alerts, feature-level hotspot tracking), alternatives considered, and consequences. --- .../ADR-0023-cloud-cost-observability.md | 71 +++++++++++++++++++ 1 file changed, 71 insertions(+) create mode 100644 docs/decisions/ADR-0023-cloud-cost-observability.md diff --git a/docs/decisions/ADR-0023-cloud-cost-observability.md b/docs/decisions/ADR-0023-cloud-cost-observability.md new file mode 100644 index 000000000..f63f3ea27 --- /dev/null +++ b/docs/decisions/ADR-0023-cloud-cost-observability.md @@ -0,0 +1,71 @@ +# ADR-0023: Cloud Cost Observability and Budget-Guardrail Automation + +- **Status**: Accepted +- **Date**: 2026-04-09 +- **Deciders**: Project maintainers + +## Context + +Taskdeck is transitioning from a purely local-first SQLite tool to a cloud-hosted deployment model (see ADR-0014, platform expansion strategy). Cloud hosting introduces ongoing variable costs that do not exist in local-first operation: compute instances, LLM API calls, storage growth, logging/telemetry volume, and network egress. + +Three characteristics make proactive cost observability essential: + +1. **LLM API calls are high-variance**: A single user session with tool-calling can generate 5+ provider round-trips. With OpenAI GPT-4o-mini at ~$0.00088 per 3-round conversation (documented in SPIKE_618), costs scale unpredictably with user adoption and chat complexity. + +2. **Local-first heritage means no existing cloud cost discipline**: The team has never operated cloud infrastructure at scale. Without explicit budget guardrails, cost surprises are likely during the v0.2.0 cloud launch. + +3. **Several features have superlinear cost scaling**: Logging volume, LLM token consumption, database storage, and SignalR connection counts all grow faster than user count under realistic usage patterns. + +Issue #104 (OPS-12) requires establishing cost visibility, budget alerting, and mitigation playbooks before cloud deployment begins. + +## Decision + +Establish a proactive cloud cost observability framework with three layers: + +1. **Cost telemetry and dashboards**: Define cost dimensions (compute, storage, LLM API, logging, network), track them through cloud provider billing APIs and application-level metrics, and maintain a monthly cost review workflow. + +2. **Budget alert thresholds**: Implement tiered alerting at 70% (warning), 90% (critical), and 100% (hard cap) of monthly budget. Alerts route to documented owners with escalation paths. + +3. **Feature-level cost hotspot registry**: Maintain a living document mapping high-variance features to their cost drivers, scaling behavior, mitigation levers, and action owners. This registry is reviewed monthly alongside the cost dashboard. + +Supporting artifacts: +- `docs/ops/CLOUD_COST_OBSERVABILITY.md` — framework, dimensions, review workflow +- `docs/ops/COST_HOTSPOT_REGISTRY.md` — feature-level cost risk tracking +- `docs/ops/BUDGET_BREACH_RUNBOOK.md` — detection-to-resolution playbook + +## Alternatives Considered + +- **Reactive-only cost management**: Wait for cost surprises and address them as incidents. Rejected because LLM API costs can spike rapidly (a bug enabling unbounded tool-calling loops could exhaust a monthly budget in hours), and cloud provider billing is typically delayed 4-24 hours. + +- **Third-party cost management platform (e.g., Kubecost, Vantage, CloudHealth)**: Adds operational complexity and cost. The current single-node deployment (see `docs/ops/DEPLOYMENT_TERRAFORM_BASELINE.md`) does not justify a dedicated cost management tool. Revisit when multi-node or multi-cloud deployment is in scope. + +- **Cloud provider native budgets only (AWS Budgets)**: Necessary but insufficient. AWS Budgets alone cannot correlate application-level behavior (e.g., which feature or user is driving LLM cost) with billing data. The framework uses provider budgets as the alerting backbone while adding application-level cost attribution. + +- **Hard spending caps with automatic shutdown**: Too aggressive for a product with active users. The framework uses graduated mitigation (rate-limit, degrade, scale-down) rather than hard shutdown, preserving non-LLM functionality during cost incidents. + +## Consequences + +**Positive**: +- Cost surprises during v0.2.0 cloud launch are caught early through tiered alerts. +- Monthly review cadence creates institutional knowledge about cost trends before they become emergencies. +- Feature owners have explicit accountability for cost-impacting decisions. +- Budget breach runbook reduces mean-time-to-mitigate for cost incidents. + +**Negative**: +- Monthly review workflow adds operational overhead (estimated 30-60 minutes per review). +- Cost estimates in the hotspot registry are approximations that require calibration against real production data. +- Alert thresholds may need tuning during initial cloud operation — too sensitive causes alert fatigue, too loose defeats the purpose. + +**Neutral**: +- Cost observability artifacts become part of the ops documentation surface that must be maintained alongside infrastructure changes. +- The framework is cloud-provider-aware (AWS-focused given the Terraform baseline) but the principles are portable. + +## References + +- Issue: #104 (OPS-12: Cloud cost observability and budget-guardrail automation) +- Terraform baseline: `docs/ops/DEPLOYMENT_TERRAFORM_BASELINE.md` (#102) +- Observability baseline: `docs/ops/OBSERVABILITY_BASELINE.md` (#68) +- LLM cost context: `docs/spikes/SPIKE_618_COMPLETED.md` (tool-calling cost model) +- Managed-key quota policy: `docs/security/MANAGED_KEY_USAGE_POLICY.md` (#240) +- Platform expansion strategy: ADR-0014 +- Disaster recovery runbook: `docs/ops/DISASTER_RECOVERY_RUNBOOK.md` (#86) From 522d946eae1e13a68d11907f59261c7ad95ff1f6 Mon Sep 17 00:00:00 2001 From: Chris0Jeky Date: Thu, 9 Apr 2026 03:22:32 +0100 Subject: [PATCH 02/10] Add ADR-0023 to ADR index --- docs/decisions/INDEX.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/decisions/INDEX.md b/docs/decisions/INDEX.md index 3af5d87d4..7121df1eb 100644 --- a/docs/decisions/INDEX.md +++ b/docs/decisions/INDEX.md @@ -24,3 +24,4 @@ | [0020](ADR-0020-plugin-extension-architecture.md) | Plugin/Extension Architecture RFC and Sandboxing Constraints | Proposed | 2026-04-01 | | [0021](ADR-0021-jwt-invalidation-user-active-middleware.md) | JWT Invalidation — User-Active Middleware over Token Blocklist | Accepted | 2026-04-03 | | [0022](ADR-0022-analytics-export-csv-first-pdf-deferred.md) | Analytics Export — CSV First, PDF Deferred | Accepted | 2026-04-08 | +| [0023](ADR-0023-cloud-cost-observability.md) | Cloud Cost Observability and Budget-Guardrail Automation | Accepted | 2026-04-09 | From 6a0c68bc364fcfeb6b73f67136dd4fb1b528ee12 Mon Sep 17 00:00:00 2001 From: Chris0Jeky Date: Thu, 9 Apr 2026 03:23:56 +0100 Subject: [PATCH 03/10] Add cloud cost observability framework Define cost telemetry dimensions (compute, storage, LLM API, logging, network, CI/CD), three-tier budget alert thresholds (70%/90%/100%), monthly review workflow with checklist, anomaly triage process, dashboard recommendations, and Terraform budget alert template. --- docs/ops/CLOUD_COST_OBSERVABILITY.md | 273 +++++++++++++++++++++++++++ 1 file changed, 273 insertions(+) create mode 100644 docs/ops/CLOUD_COST_OBSERVABILITY.md diff --git a/docs/ops/CLOUD_COST_OBSERVABILITY.md b/docs/ops/CLOUD_COST_OBSERVABILITY.md new file mode 100644 index 000000000..26d21fa41 --- /dev/null +++ b/docs/ops/CLOUD_COST_OBSERVABILITY.md @@ -0,0 +1,273 @@ +# Cloud Cost Observability Framework + +Last Updated: 2026-04-09 +Issue: `#104` OPS-12 Cloud cost observability and budget-guardrail automation +ADR: ADR-0023 + +--- + +## Purpose + +Define the cost telemetry dimensions, budget alert thresholds, monthly review workflow, and anomaly triage process for Taskdeck cloud deployments. This framework applies once Taskdeck moves beyond local-first operation into hosted environments (v0.2.0+). + +--- + +## Cost Telemetry Dimensions + +Cloud costs are tracked across six dimensions. Each dimension maps to a billing line item, an application-level metric (where applicable), and a dashboard panel. + +### 1. Compute (EC2 / Container Hosting) + +| Attribute | Value | +|---|---| +| Billing source | AWS EC2 on-demand or reserved instance hours | +| Current baseline | Single `t3.medium` (dev), `t3.large` (staging/prod) per `DEPLOYMENT_TERRAFORM_BASELINE.md` | +| Application metric | None (infrastructure-level only) | +| Estimated monthly cost | $30-70 (single-node, on-demand) | +| Scaling driver | User concurrency, background worker load | + +### 2. Storage (EBS + S3) + +| Attribute | Value | +|---|---| +| Billing source | EBS volume (gp3) + S3 backup bucket | +| Current baseline | 20-50 GB EBS for SQLite, S3 with 90-day noncurrent version expiry | +| Application metric | Database file size (via health endpoint), S3 object count | +| Estimated monthly cost | $5-15 (EBS) + $1-5 (S3) | +| Scaling driver | Board/card/audit data volume, backup frequency, export artifact retention | + +### 3. LLM API Calls (OpenAI / Gemini) + +| Attribute | Value | +|---|---| +| Billing source | Provider API usage (OpenAI, Google Gemini) | +| Application metric | `ILlmQuotaService` token usage records, `taskdeck.llm.tokens.used` | +| Current baseline | GPT-4o-mini: ~$0.15/1M input tokens, ~$0.60/1M output tokens; Gemini 2.5 Flash: ~$0.15/1M input tokens, ~$0.60/1M output tokens | +| Estimated monthly cost | $5-50 (light usage, 10-50 active users) to $200-500 (heavy usage, 100+ users with tool-calling) | +| Scaling driver | Chat messages per user, tool-calling rounds per message (max 5), capture triage volume | + +LLM costs are the highest-variance dimension. See `docs/ops/COST_HOTSPOT_REGISTRY.md` for detailed breakdown. + +### 4. Logging and Telemetry + +| Attribute | Value | +|---|---| +| Billing source | CloudWatch Logs ingestion/storage, or OTLP-compatible backend (Grafana Cloud, Datadog) | +| Application metric | Log bytes per request (estimated from `Observability:*` config) | +| Current baseline | OpenTelemetry traces + metrics via OTLP or console exporter | +| Estimated monthly cost | $5-30 (low-volume, structured logging) to $100-300 (verbose logging, high request volume) | +| Scaling driver | Request volume, log verbosity level, trace sampling rate, metric cardinality | + +### 5. Network (Data Transfer) + +| Attribute | Value | +|---|---| +| Billing source | AWS data transfer out, inter-AZ traffic (if multi-AZ) | +| Application metric | Response payload sizes (approximated from API metrics) | +| Estimated monthly cost | $1-10 (single-AZ, moderate traffic) | +| Scaling driver | API response volume, SignalR WebSocket traffic, export downloads | + +### 6. CI/CD and Artifact Storage + +| Attribute | Value | +|---|---| +| Billing source | GitHub Actions minutes, container registry storage | +| Application metric | None (CI platform-level) | +| Estimated monthly cost | $0 (free tier) to $20-50 (heavy CI, private runners) | +| Scaling driver | PR volume, test suite duration, Docker image size and retention | + +--- + +## Budget Alert Thresholds + +Budget alerts use a three-tier model. The monthly budget target is set per environment and reviewed quarterly. + +| Tier | Threshold | Severity | Action | +|---|---|---|---| +| Warning | 70% of monthly budget | Low | Notification to cost-owner; review current spend trajectory | +| Critical | 90% of monthly budget | High | Escalation to on-call; begin mitigation assessment | +| Hard cap | 100% of monthly budget | Critical | Execute mitigation actions from `BUDGET_BREACH_RUNBOOK.md` | + +### Suggested Initial Monthly Budgets + +These are starting points for a small-team deployment. Adjust after the first 2-3 months of production data. + +| Environment | Monthly budget | Rationale | +|---|---|---| +| Dev | $50 | Disposable, minimal usage | +| Staging | $100 | Test workloads, occasional load testing | +| Prod | $300 | 10-50 active users, moderate LLM usage | + +### Alert Configuration + +**AWS Budgets** (primary alerting mechanism for infrastructure costs): + +- Create one AWS Budget per environment with the monthly target above. +- Configure SNS notifications at 70%, 90%, and 100% thresholds. +- Route SNS to email (initially) or PagerDuty/Slack (when available). + +**Application-level LLM cost alerts** (supplementary): + +- The existing `ILlmQuotaService` tracks per-user token consumption. +- Add a daily aggregate check: if total LLM token spend across all users exceeds `(monthly_budget * 0.70) / 30` on any single day, emit a warning log and optional webhook notification. +- The `LlmQuota:GlobalBudgetCeilingTokens` config key provides a hard daily ceiling (see `docs/security/MANAGED_KEY_USAGE_POLICY.md`). + +### Alert Owners + +| Cost dimension | Primary owner | Escalation | +|---|---|---| +| Compute | Infrastructure lead | Project maintainers | +| Storage | Infrastructure lead | Project maintainers | +| LLM API | Product/backend lead | Project maintainers | +| Logging/telemetry | Infrastructure lead | Project maintainers | +| Network | Infrastructure lead | Project maintainers | +| CI/CD | DevOps lead | Project maintainers | + +For a solo-operator deployment, all ownership defaults to the operator. + +--- + +## Monthly Cost Review Workflow + +Cadence: First working day of each month (or within 3 business days). + +### Pre-Review Checklist + +- [ ] Pull current-month billing summary from cloud provider console +- [ ] Pull LLM token usage summary from `ILlmQuotaService` / application logs +- [ ] Compare actual spend against budget for each dimension +- [ ] Note any dimensions exceeding 70% of their allocation +- [ ] Pull previous month's review notes for trend comparison + +### Review Agenda + +1. **Budget vs. actual**: Review each dimension. Flag any >10% month-over-month increase. +2. **LLM cost deep-dive**: Review per-user and per-feature token consumption. Identify top-5 token consumers. Check tool-calling round counts for anomalies. +3. **Storage growth**: Check SQLite database size trend. Review S3 backup object count and total size. Verify noncurrent version expiry is working. +4. **Logging volume**: Review CloudWatch / OTLP ingestion volume. Check for noisy log sources (e.g., verbose middleware, high-cardinality trace attributes). +5. **Anomaly review**: Investigate any alerts fired during the month. Were they true anomalies or expected spikes? +6. **Hotspot registry update**: Review `docs/ops/COST_HOTSPOT_REGISTRY.md`. Update estimates with actual data. Add new hotspots if discovered. +7. **Action items**: Document mitigation actions, budget adjustments, or configuration changes needed. + +### Post-Review Outputs + +- Updated cost trend notes (inline in this document or in a linked tracking issue) +- Updated hotspot registry if estimates changed +- Budget adjustment proposals for next quarter (if needed) +- Action items assigned to specific owners with deadlines + +--- + +## Anomaly Triage Process + +An anomaly is any cost spike that exceeds 150% of the expected daily spend for a dimension, or any alert at the Critical (90%) tier or above. + +### Triage Steps + +1. **Identify the dimension**: Which cost category spiked? (Compute, LLM, Storage, Logging, Network, CI/CD) +2. **Correlate with application events**: Check deployment logs, feature flag changes, traffic patterns, and user activity for the same time window. +3. **Check for known causes**: + - Was there a load test or demo? + - Was a new feature deployed that increases LLM usage? + - Did log verbosity change? + - Is there a runaway background worker? +4. **Assess impact**: Is the spike ongoing or a one-time event? What is the projected monthly impact if it continues? +5. **Decide on action**: + - **Expected and acceptable**: Document in monthly review, adjust budget if needed. + - **Expected but excessive**: Apply mitigation (see `BUDGET_BREACH_RUNBOOK.md`). + - **Unexpected**: Investigate root cause, apply immediate mitigation, file an incident. + +### Escalation Path + +| Severity | Response time | Escalation | +|---|---|---| +| Warning (70%) | Next business day | Cost owner reviews spend trajectory | +| Critical (90%) | Within 4 hours | On-call begins mitigation assessment | +| Hard cap (100%) | Within 1 hour | Execute runbook, notify all stakeholders | + +--- + +## Cost Dashboard + +### Recommended Dashboard Panels + +Deploy alongside the existing observability dashboard (see `docs/ops/OBSERVABILITY_BASELINE.md`). + +1. **Monthly spend by dimension** — stacked bar chart, one bar per dimension per month. +2. **Daily spend trend** — line chart showing daily total spend with 70%/90% budget threshold lines. +3. **LLM token consumption** — line chart of daily token usage (input + output), broken down by provider (OpenAI, Gemini, Mock). +4. **LLM cost per user (top 10)** — horizontal bar chart of top token consumers. +5. **Storage growth** — line chart of database file size and S3 total object size over time. +6. **Logging ingestion volume** — line chart of daily log bytes ingested. + +### Implementation Path + +Phase 1 (v0.2.0 launch): AWS Budgets + manual monthly review using AWS Cost Explorer. +Phase 2 (post-launch): Grafana dashboard pulling from CloudWatch Metrics and application-level metrics via OTLP. +Phase 3 (scale-out): Integrate cost attribution tags into Terraform resources for per-feature cost allocation. + +--- + +## Terraform Budget Alert Template + +A sample AWS Budget resource for use in the Terraform baseline: + +```hcl +resource "aws_budgets_budget" "taskdeck_monthly" { + name = "taskdeck-${var.environment}-monthly" + budget_type = "COST" + limit_amount = var.monthly_budget_limit + limit_unit = "USD" + time_unit = "MONTHLY" + + notification { + comparison_operator = "GREATER_THAN" + threshold = 70 + threshold_type = "PERCENTAGE" + notification_type = "ACTUAL" + subscriber_email_addresses = var.budget_alert_emails + } + + notification { + comparison_operator = "GREATER_THAN" + threshold = 90 + threshold_type = "PERCENTAGE" + notification_type = "ACTUAL" + subscriber_email_addresses = var.budget_alert_emails + } + + notification { + comparison_operator = "GREATER_THAN" + threshold = 100 + threshold_type = "PERCENTAGE" + notification_type = "ACTUAL" + subscriber_email_addresses = var.budget_alert_emails + } +} + +variable "monthly_budget_limit" { + description = "Monthly budget limit in USD" + type = string + default = "300" +} + +variable "budget_alert_emails" { + description = "Email addresses for budget alert notifications" + type = list(string) +} +``` + +This template can be added to the existing Terraform module at `deploy/terraform/aws/modules/single_node/` when budget alerting is wired into the infrastructure baseline. + +--- + +## References + +- ADR-0023: Cloud Cost Observability and Budget-Guardrail Automation +- Feature cost hotspot registry: `docs/ops/COST_HOTSPOT_REGISTRY.md` +- Budget breach runbook: `docs/ops/BUDGET_BREACH_RUNBOOK.md` +- Observability baseline: `docs/ops/OBSERVABILITY_BASELINE.md` +- Terraform deployment baseline: `docs/ops/DEPLOYMENT_TERRAFORM_BASELINE.md` +- Managed-key usage policy: `docs/security/MANAGED_KEY_USAGE_POLICY.md` +- LLM provider setup guide: `docs/platform/LLM_PROVIDER_SETUP_GUIDE.md` +- LLM tool-calling cost model: `docs/spikes/SPIKE_618_COMPLETED.md` From c1e16fc6e05abf0baef95d14e855dc0ea74733f3 Mon Sep 17 00:00:00 2001 From: Chris0Jeky Date: Thu, 9 Apr 2026 03:25:18 +0100 Subject: [PATCH 04/10] Add feature cost hotspot registry Document six high-variance cost features with estimated cost ranges, scaling behavior, current guardrails, mitigation levers, and action owners: LLM API usage, logging/telemetry, database storage, SignalR connections, CI/CD pipelines, and MCP transport. --- docs/ops/COST_HOTSPOT_REGISTRY.md | 184 ++++++++++++++++++++++++++++++ 1 file changed, 184 insertions(+) create mode 100644 docs/ops/COST_HOTSPOT_REGISTRY.md diff --git a/docs/ops/COST_HOTSPOT_REGISTRY.md b/docs/ops/COST_HOTSPOT_REGISTRY.md new file mode 100644 index 000000000..c59247e7d --- /dev/null +++ b/docs/ops/COST_HOTSPOT_REGISTRY.md @@ -0,0 +1,184 @@ +# Feature Cost Hotspot Registry + +Last Updated: 2026-04-09 +Issue: `#104` OPS-12 Cloud cost observability and budget-guardrail automation +Parent: `docs/ops/CLOUD_COST_OBSERVABILITY.md` + +--- + +## Purpose + +Track features with high-variance or superlinear cost scaling. Each entry documents the cost driver, estimated cost range, scaling behavior, mitigation levers, and action owner. This registry is reviewed during the monthly cost review (see `CLOUD_COST_OBSERVABILITY.md`). + +--- + +## Hotspot Entry Format + +Each hotspot follows this structure: + +- **Feature**: Name and brief description +- **Cost dimension**: Which billing category is affected +- **Estimated cost range**: Low/high monthly estimate for the expected user base +- **Scaling behavior**: How cost grows relative to users/usage +- **Current guardrails**: What controls already exist +- **Mitigation levers**: Actions available to reduce cost +- **Action owner**: Who is responsible for monitoring and mitigation +- **Risk level**: Low / Medium / High / Critical + +--- + +## Hotspot 1: LLM API Usage (Chat and Capture Triage) + +| Attribute | Detail | +|---|---| +| Feature | Automation Chat (`ChatService`), capture triage (`LlmQueueToProposalWorker`), tool-calling orchestrator | +| Cost dimension | LLM API (OpenAI / Gemini) | +| Estimated cost range | $5-50/month (10-50 users, light chat) to $200-500/month (100+ users, heavy tool-calling) | +| Scaling behavior | **Superlinear** — each chat message may trigger 1-5 tool-calling rounds, each round is a full API call with growing context window. A single complex conversation can cost 5-10x a simple one. Capture triage adds per-item LLM cost. | +| Current guardrails | Per-user rate limit: 60 req/hr. Per-user token limit: 100K tokens/day. Global budget ceiling config (`LlmQuota:GlobalBudgetCeilingTokens`). Tool-calling loop cap: 5 rounds, 60s timeout. Tool result truncation: 8KB max. Kill-switch (global/surface/per-user). Mock provider default (zero cost). | +| Mitigation levers | 1. Reduce `LlmToolCalling:MaxRounds` (default 5 → 3). 2. Lower per-user token daily limit. 3. Switch high-volume users to Mock provider. 4. Activate surface-level kill-switch for Chat or CaptureTriage. 5. Reduce context window size (`BoardContextBuilder` budget). 6. Switch from GPT-4o-mini to a cheaper model. 7. Enable clarification detection to reduce wasted rounds (`ClarificationDetector`). | +| Action owner | Product/backend lead | +| Risk level | **High** — highest variance cost component with no natural ceiling per conversation | + +### Per-Request Cost Estimates (as of 2026-04) + +| Scenario | Input tokens | Output tokens | Estimated cost (GPT-4o-mini) | +|---|---|---|---| +| Simple chat (no tools) | ~500 | ~200 | ~$0.00020 | +| Chat with 1 read tool | ~1,200 | ~400 | ~$0.00042 | +| Chat with 3 tool rounds | ~3,000 | ~800 | ~$0.00093 | +| Chat with 5 tool rounds (max) | ~5,500 | ~1,200 | ~$0.00155 | +| Capture triage (per item) | ~300 | ~150 | ~$0.00014 | + +These estimates assume GPT-4o-mini pricing ($0.15/1M input, $0.60/1M output). Gemini 2.5 Flash has similar pricing. Actual costs depend on conversation length, board context size, and tool result sizes. + +### Monthly Projections + +| Usage level | Users | Messages/user/day | Tool rounds/msg | Monthly LLM cost | +|---|---|---|---|---| +| Light | 10 | 5 | 1.5 avg | ~$8 | +| Moderate | 50 | 10 | 2.0 avg | ~$85 | +| Heavy | 100 | 15 | 2.5 avg | ~$350 | +| Peak (with triage) | 100 | 15 + 20 triage | 2.5 avg | ~$430 | + +--- + +## Hotspot 2: Logging and Telemetry Volume + +| Attribute | Detail | +|---|---| +| Feature | OpenTelemetry traces/metrics, application logs, request correlation | +| Cost dimension | Logging / telemetry (CloudWatch, Grafana Cloud, or OTLP backend) | +| Estimated cost range | $5-30/month (structured, sampled) to $100-300/month (verbose, unsampled) | +| Scaling behavior | **Linear to superlinear** — log volume scales with request count. Verbose logging (DEBUG level) or high-cardinality trace attributes can cause 10-50x volume increase. Tool-calling conversations generate multiple log entries per round. | +| Current guardrails | Configurable log level. Security logging redaction baseline (sanitized exceptions, generic error messages). Configurable OTLP exporter. Metric export interval configurable. | +| Mitigation levers | 1. Set log level to `Warning` or `Error` in production. 2. Enable trace sampling (e.g., 10% of requests). 3. Reduce metric export interval. 4. Reduce `MetricExportIntervalSeconds`. 5. Set CloudWatch log retention to 14-30 days (not indefinite). 6. Exclude health-check endpoints from trace collection. 7. Cap log line length for tool-call results. | +| Action owner | Infrastructure lead | +| Risk level | **Medium** — predictable at low volume but can spike with verbose config or traffic surges | + +### Retention Policy Recommendations + +| Log type | Retention | Rationale | +|---|---|---| +| Application logs (INFO+) | 30 days | Sufficient for operational debugging | +| Application logs (DEBUG) | 7 days | Only enabled during active investigation | +| Trace data | 14 days | Covers typical incident investigation window | +| Metrics | 90 days | Supports monthly trend analysis | +| Audit trail (application-level) | Indefinite (in SQLite) | Compliance and provenance requirements | + +--- + +## Hotspot 3: Database Storage Growth (SQLite / EBS) + +| Attribute | Detail | +|---|---| +| Feature | SQLite database (boards, cards, audit trail, chat history, proposals, notifications) | +| Cost dimension | Storage (EBS volume) | +| Estimated cost range | $5-15/month (20-50 GB gp3 EBS) | +| Scaling behavior | **Sublinear initially, linear long-term** — audit trail and chat history grow with every operation. Without archival, database size grows indefinitely. SQLite VACUUM can reclaim space from deletions. | +| Current guardrails | S3 backup with 90-day noncurrent version expiry. EBS destroy protection on staging/prod. Account deletion anonymizes PII but does not reclaim space. | +| Mitigation levers | 1. Implement periodic SQLite VACUUM (reclaim deleted space). 2. Archive old audit trail entries to cold storage (S3 Glacier). 3. Set chat history retention limit (e.g., 90 days). 4. Compress old export artifacts. 5. Monitor EBS usage and resize proactively. 6. Enable WAL checkpointing to control WAL file growth. | +| Action owner | Infrastructure lead | +| Risk level | **Low** — predictable growth, but uncapped audit trail could become significant over years | + +### Growth Estimates + +| Data type | Estimated size per record | Records/user/month | 100 users, 12 months | +|---|---|---|---| +| Cards | ~2 KB | 50 | ~120 MB | +| Audit entries | ~500 bytes | 200 | ~120 MB | +| Chat messages | ~1 KB | 150 | ~180 MB | +| Proposals | ~1 KB | 30 | ~36 MB | +| Notifications | ~500 bytes | 100 | ~60 MB | +| **Total estimate** | | | **~516 MB** | + +SQLite overhead and indexes add approximately 30-50%, bringing the estimated 12-month database size for 100 users to approximately 700 MB - 1 GB. + +--- + +## Hotspot 4: SignalR Connection Overhead + +| Attribute | Detail | +|---|---| +| Feature | SignalR WebSocket connections for realtime board collaboration | +| Cost dimension | Compute (memory per connection), network (WebSocket frames) | +| Estimated cost range | Negligible at current scale ($0-5/month additional compute) | +| Scaling behavior | **Linear** — each connected user maintains one persistent WebSocket. Memory: ~50-100 KB per connection. Network: minimal for idle connections, increases with board mutation frequency. | +| Current guardrails | Single-node in-process SignalR (no external backplane). Board-scoped subscription authorization. Polling fallback when WebSocket unavailable. | +| Mitigation levers | 1. Implement idle connection timeout (disconnect after N minutes of inactivity). 2. Batch board mutation events (debounce rapid-fire updates). 3. Move to Azure SignalR Service or Redis backplane for scale-out (cost shifts from compute to managed service). 4. Rate-limit SignalR event frequency per board. | +| Action owner | Backend lead | +| Risk level | **Low** — negligible cost at single-node scale; becomes relevant at 500+ concurrent connections | + +--- + +## Hotspot 5: CI/CD Pipeline and Artifact Storage + +| Attribute | Detail | +|---|---| +| Feature | GitHub Actions CI (`ci-required.yml`, `ci-nightly.yml`, `ci-extended.yml`), Docker image builds | +| Cost dimension | CI/CD (GitHub Actions minutes, container registry) | +| Estimated cost range | $0/month (free tier, public repo) to $20-50/month (private repo, heavy CI) | +| Scaling behavior | **Step function** — cost jumps when exceeding free-tier minutes (2,000 min/month for free, 3,000 for Pro). Docker image storage grows with image count and tag retention. | +| Current guardrails | CI-required is the PR gate (lightweight). CI-extended auto-triggers on infrastructure changes. CI-nightly runs extended checks. | +| Mitigation levers | 1. Prune old Docker images (keep last N tags). 2. Use GitHub Actions caching for dependency restore. 3. Reduce nightly CI frequency if cost is a concern. 4. Use smaller runners for doc-only PRs. 5. Set container registry retention policies. | +| Action owner | DevOps lead | +| Risk level | **Low** — predictable and within free-tier for most open-source projects | + +--- + +## Hotspot 6: MCP HTTP Transport and API Key Usage + +| Attribute | Detail | +|---|---| +| Feature | MCP HTTP endpoint (`/mcp`), API key authentication, external tool integrations | +| Cost dimension | Compute (request processing), LLM API (if MCP tools trigger LLM calls) | +| Estimated cost range | $0-10/month (direct compute cost negligible); LLM cost depends on tool usage patterns | +| Scaling behavior | **Linear with external integration frequency** — each MCP tool call is an HTTP request. Write tools that produce proposals may trigger LLM downstream. Rate limited at 60 req/60s per API key. | +| Current guardrails | API key rate limiting (60 req/60s). Write tools produce proposals (no direct board mutation). `approve_proposal` intentionally excluded from MCP. | +| Mitigation levers | 1. Reduce per-key rate limit. 2. Revoke unused API keys. 3. Disable MCP HTTP transport when not needed. 4. Audit API key usage patterns monthly. | +| Action owner | Product/backend lead | +| Risk level | **Low** — rate-limited and proposal-gated; indirect LLM cost is covered by Hotspot 1 | + +--- + +## Review Schedule + +This registry is reviewed during the monthly cost review (first working day of each month). + +Updates required when: +- A new feature with potential cost impact is shipped +- Actual costs significantly deviate from estimates (>50% delta) +- Mitigation levers are exercised (document what was changed and the effect) +- New cost dimensions are identified (e.g., DNS, CDN, managed database) + +--- + +## References + +- Cloud cost observability framework: `docs/ops/CLOUD_COST_OBSERVABILITY.md` +- Budget breach runbook: `docs/ops/BUDGET_BREACH_RUNBOOK.md` +- LLM tool-calling cost model: `docs/spikes/SPIKE_618_COMPLETED.md` +- Managed-key usage policy: `docs/security/MANAGED_KEY_USAGE_POLICY.md` +- Managed-key incident runbook: `docs/security/MANAGED_KEY_INCIDENT_RUNBOOK.md` +- Observability baseline: `docs/ops/OBSERVABILITY_BASELINE.md` +- Terraform deployment baseline: `docs/ops/DEPLOYMENT_TERRAFORM_BASELINE.md` From 241c27e6af9a6fe96dd35aa0fc0ac279e88f4009 Mon Sep 17 00:00:00 2001 From: Chris0Jeky Date: Thu, 9 Apr 2026 03:26:29 +0100 Subject: [PATCH 05/10] Add budget breach runbook Five-phase playbook covering detection, triage (with decision tree for LLM/logging/compute/storage root causes), graduated mitigation actions, stabilization checks, and post-incident review process. Includes quick-reference emergency actions table. --- docs/ops/BUDGET_BREACH_RUNBOOK.md | 218 ++++++++++++++++++++++++++++++ 1 file changed, 218 insertions(+) create mode 100644 docs/ops/BUDGET_BREACH_RUNBOOK.md diff --git a/docs/ops/BUDGET_BREACH_RUNBOOK.md b/docs/ops/BUDGET_BREACH_RUNBOOK.md new file mode 100644 index 000000000..ec06fa0dc --- /dev/null +++ b/docs/ops/BUDGET_BREACH_RUNBOOK.md @@ -0,0 +1,218 @@ +# Budget Breach Runbook + +Last Updated: 2026-04-09 +Issue: `#104` OPS-12 Cloud cost observability and budget-guardrail automation +Parent: `docs/ops/CLOUD_COST_OBSERVABILITY.md` + +--- + +## Purpose + +Step-by-step playbook for responding to cloud cost budget breaches. Covers detection, triage, mitigation, and post-incident review. This runbook is triggered when budget alerts fire at the Critical (90%) or Hard Cap (100%) tier. + +--- + +## Severity Definitions + +| Severity | Trigger | Response time | Owner | +|---|---|---|---| +| Warning | 70% of monthly budget reached | Next business day | Cost dimension owner | +| Critical | 90% of monthly budget reached | Within 4 hours | On-call + cost dimension owner | +| Hard cap | 100% of monthly budget reached | Within 1 hour | On-call + all stakeholders | + +--- + +## Phase 1: Detection + +Budget breach alerts arrive through one of these channels: + +1. **AWS Budgets SNS notification** — email or integration (Slack/PagerDuty) when infrastructure spend crosses a threshold. +2. **Application-level LLM quota alert** — log warning when daily aggregate LLM token spend exceeds the projected daily share of the monthly budget. +3. **Manual discovery** — spotted during monthly cost review or ad-hoc billing console check. + +### Detection Checklist + +- [ ] Confirm the alert is genuine (not a test or duplicate) +- [ ] Identify the severity tier (Warning / Critical / Hard Cap) +- [ ] Identify which cost dimension triggered the alert (Compute, Storage, LLM, Logging, Network, CI/CD) +- [ ] Record the alert timestamp and current spend amount +- [ ] Notify the cost dimension owner (see `CLOUD_COST_OBSERVABILITY.md` alert owners table) + +--- + +## Phase 2: Triage + +Goal: Determine the root cause and assess ongoing impact within the response time window. + +### Triage Decision Tree + +``` +Is the cost spike from LLM API usage? +├── Yes → Go to "LLM Cost Triage" +├── No + Is the cost spike from logging/telemetry? + ├── Yes → Go to "Logging Cost Triage" + ├── No + Is the cost spike from compute? + ├── Yes → Go to "Compute Cost Triage" + ├── No + Is the cost spike from storage? + ├── Yes → Go to "Storage Cost Triage" + └── No → Go to "General Cost Triage" +``` + +### LLM Cost Triage + +1. Check `ILlmQuotaService` usage data for the current period: + - Which users are the top token consumers? + - Which surface (Chat, CaptureTriage, Worker) is generating the most usage? + - Are tool-calling round counts abnormally high? +2. Check for runaway patterns: + - Is a single user or automated integration consuming >30% of total LLM spend? + - Are there tool-calling loops (same tool called repeatedly with identical arguments)? + - Is the `ClarificationDetector` being bypassed, causing extra rounds? +3. Check for configuration drift: + - Was `LlmToolCalling:MaxRounds` increased from the default? + - Was `LlmQuota:GlobalBudgetCeilingTokens` raised or removed? + - Was a more expensive model configured (e.g., GPT-4o instead of GPT-4o-mini)? +4. Check LLM provider dashboard (OpenAI/Gemini) for independent cost confirmation. + +### Logging Cost Triage + +1. Check CloudWatch / OTLP backend ingestion volume for the current period. +2. Identify the top log sources by volume (which service, endpoint, or component). +3. Check if log level was changed (e.g., DEBUG enabled in production). +4. Check if trace sampling rate was reduced (capturing 100% of traces). +5. Look for noisy error loops generating repeated log entries. + +### Compute Cost Triage + +1. Check if the instance type was changed or a larger instance provisioned. +2. Check CPU and memory utilization — is the instance right-sized? +3. Check if additional instances were spun up (manual or auto-scaling drift). +4. Check for zombie processes or stuck background workers consuming resources. + +### Storage Cost Triage + +1. Check EBS volume size and utilization. +2. Check S3 bucket size — is the noncurrent version expiry policy working? +3. Check SQLite database file size — has it grown unexpectedly? +4. Check for large export artifacts or backup files accumulating. + +### General Cost Triage + +1. Check AWS Cost Explorer for the top spending services. +2. Compare current-month daily spend to the previous month's daily average. +3. Identify any new AWS resources that were not part of the baseline. +4. Check for data transfer spikes (large export downloads, API abuse). + +--- + +## Phase 3: Mitigation + +Apply the minimum effective mitigation for the identified root cause. Prefer graduated response over hard shutdown. + +### LLM Cost Mitigation Actions + +Listed from least disruptive to most disruptive: + +| Priority | Action | Impact | How to execute | +|---|---|---|---| +| 1 | Rate-limit top consumers | Affected users get 429 responses | Reduce `LlmQuota:RequestsPerHour` or `LlmQuota:TokensPerDay` for specific users via kill-switch | +| 2 | Reduce tool-calling rounds | Fewer tool calls per conversation, less capable but cheaper | Set `LlmToolCalling:MaxRounds` from 5 to 2-3 via config | +| 3 | Switch to cheaper model | Potentially lower quality responses | Change `Llm:OpenAi:Model` to a cheaper variant | +| 4 | Activate surface kill-switch | One LLM surface disabled (e.g., Chat only) | `POST /api/llm/kill-switch` with `KillSwitchScope: Surface` | +| 5 | Activate per-user kill-switch | Specific abusive user blocked from LLM | `POST /api/llm/kill-switch` with `KillSwitchScope: Identity` | +| 6 | Activate global kill-switch | All LLM features disabled; non-LLM features unaffected | `POST /api/llm/kill-switch` with `KillSwitchScope: Global` | +| 7 | Switch all users to Mock provider | LLM features return deterministic mock responses | Set `Llm:Provider` to `Mock`, restart API | + +### Logging Cost Mitigation Actions + +| Priority | Action | Impact | How to execute | +|---|---|---|---| +| 1 | Reduce log retention | Older logs deleted sooner | Set CloudWatch log group retention to 7-14 days | +| 2 | Increase log level to Warning | INFO logs no longer ingested | Set `Logging:LogLevel:Default` to `Warning` in appsettings | +| 3 | Enable trace sampling | Fewer traces captured | Configure OTLP trace sampling rate (e.g., 10%) | +| 4 | Exclude noisy endpoints | Health checks and high-frequency endpoints stop generating traces | Add endpoint filter to OpenTelemetry configuration | +| 5 | Disable OTLP exporter | No traces or metrics exported | Set `Observability:EnableOpenTelemetry` to `false` | + +### Compute Cost Mitigation Actions + +| Priority | Action | Impact | How to execute | +|---|---|---|---| +| 1 | Right-size the instance | May reduce performance headroom | Change `instance_type` in Terraform and apply | +| 2 | Stop non-critical services | Reduced functionality | Stop staging environment if not in active use | +| 3 | Switch to reserved instances | Commitment required, ~30-60% savings | Purchase reserved instance via AWS console | + +### Storage Cost Mitigation Actions + +| Priority | Action | Impact | How to execute | +|---|---|---|---| +| 1 | Run SQLite VACUUM | Reclaims space from deleted records, brief lock | `sqlite3 /var/lib/taskdeck/taskdeck.db "VACUUM;"` | +| 2 | Reduce S3 version retention | Fewer backup versions kept | Lower noncurrent version expiry from 90 days | +| 3 | Delete old export artifacts | Users lose access to old exports | Implement S3 lifecycle rule for export objects | +| 4 | Archive old data | Audit trail or chat history moved to cold storage | Implement data archival pipeline (future work) | + +--- + +## Phase 4: Stabilization + +After mitigation is applied: + +1. **Verify the mitigation is effective**: Monitor the cost dimension for 1-2 hours to confirm the spend rate has decreased. +2. **Communicate the change**: Notify affected users if features were degraded (e.g., LLM kill-switch, reduced log retention). +3. **Document what happened**: Record the incident in a brief post-incident note: + - What triggered the breach? + - What was the root cause? + - What mitigation was applied? + - What was the estimated cost impact? + - What is the plan to prevent recurrence? +4. **Set a review date**: Schedule a follow-up within 1 week to assess whether the mitigation can be relaxed or needs to become permanent. + +--- + +## Phase 5: Post-Incident Review + +Conduct within 5 business days of the incident. + +### Review Checklist + +- [ ] Was the alert timely? Did the team respond within the target window? +- [ ] Was the triage process effective? Did we identify the root cause quickly? +- [ ] Was the mitigation proportionate? Did we apply the minimum necessary disruption? +- [ ] What configuration or architectural change would prevent this class of breach? +- [ ] Does the monthly budget need adjustment (was it set too low, or is usage genuinely growing)? +- [ ] Does the hotspot registry need updating with new data? +- [ ] Are there new mitigation levers that should be documented? + +### Outputs + +- Updated `COST_HOTSPOT_REGISTRY.md` with actual cost data from the incident +- Budget adjustment proposal if the current budget is unrealistic +- Action items for preventive changes (filed as GitHub issues) +- Updated alert thresholds if the current ones are too sensitive or too loose + +--- + +## Quick Reference: Emergency Actions + +For use when immediate action is needed and there is no time for full triage: + +| Scenario | Immediate action | Command / Config | +|---|---|---| +| LLM cost runaway | Activate global kill-switch | `POST /api/llm/kill-switch` — `{ "scope": "Global", "active": true, "reason": "Cost emergency" }` | +| Logging cost spike | Raise log level to Error | Set `Logging:LogLevel:Default` to `Error`, restart API | +| Storage filling up | Identify and remove large files | `du -sh /var/lib/taskdeck/*` then assess | +| Unknown cost source | Check AWS Cost Explorer | AWS Console → Billing → Cost Explorer → Group by Service | + +--- + +## References + +- Cloud cost observability framework: `docs/ops/CLOUD_COST_OBSERVABILITY.md` +- Feature cost hotspot registry: `docs/ops/COST_HOTSPOT_REGISTRY.md` +- Disaster recovery runbook: `docs/ops/DISASTER_RECOVERY_RUNBOOK.md` +- Managed-key incident runbook: `docs/security/MANAGED_KEY_INCIDENT_RUNBOOK.md` +- Managed-key usage policy: `docs/security/MANAGED_KEY_USAGE_POLICY.md` +- LLM provider setup guide: `docs/platform/LLM_PROVIDER_SETUP_GUIDE.md` +- Observability baseline: `docs/ops/OBSERVABILITY_BASELINE.md` From bc35d0874ca44b27bae4346d6df4fd5d33e07c35 Mon Sep 17 00:00:00 2001 From: Chris0Jeky Date: Thu, 9 Apr 2026 03:27:27 +0100 Subject: [PATCH 06/10] Update STATUS.md to reflect cost guardrails delivery Mark #104 as delivered with summary of framework, hotspot registry, runbook, and ADR-0023. --- docs/STATUS.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/STATUS.md b/docs/STATUS.md index ebda157ca..85540643b 100644 --- a/docs/STATUS.md +++ b/docs/STATUS.md @@ -937,7 +937,8 @@ Observability and scalability: - out-of-code/platform execution is now tracked, but not yet fully shipped: - production DB migration strategy (`#84`) and distributed cache strategy (`#85`) - backup/restore disaster-recovery playbook (`#86`) - - staged rollout policy (`#101`), SBOM/provenance (`#103`), cost guardrails (`#104`) + - staged rollout policy (`#101`), SBOM/provenance (`#103`) + - ~~cost guardrails (`#104`)~~ **delivered** (2026-04-09): cloud cost observability framework with six cost dimensions (compute, storage, LLM API, logging, network, CI/CD), three-tier budget alert thresholds (70%/90%/100%), monthly cost review workflow with checklist, feature cost hotspot registry covering 6 high-variance features (LLM API, logging, database, SignalR, CI/CD, MCP transport), budget breach runbook with detection-triage-mitigation-review phases, Terraform budget alert template, and ADR-0023 - cloud target topology and autoscaling ADR (`#111`) UX and operability (reconciled from product notes): From a69e3227e8a49dafa70cc556d63e61481dc5bb2a Mon Sep 17 00:00:00 2001 From: Chris0Jeky Date: Thu, 9 Apr 2026 03:27:33 +0100 Subject: [PATCH 07/10] Update IMPLEMENTATION_MASTERPLAN.md to mark #104 delivered Update Out-of-Code coverage matrix and Priority IV backlog entries for cost guardrails issue. --- docs/IMPLEMENTATION_MASTERPLAN.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/IMPLEMENTATION_MASTERPLAN.md b/docs/IMPLEMENTATION_MASTERPLAN.md index 40d1642d7..0a4897b27 100644 --- a/docs/IMPLEMENTATION_MASTERPLAN.md +++ b/docs/IMPLEMENTATION_MASTERPLAN.md @@ -953,7 +953,7 @@ Seeded from `docs/strategy/00_MASTER_STRATEGY.md` and companion pillar documents ### Priority IV (Expansion Tranche: Platform, Test, UX, Docs Maturity) -- Platform and ops maturity: `#84`, `#85`, `#86`, `#101`, `#102`, `#103`, `#104`, `#105`, `#111` +- Platform and ops maturity: `#84`, `#85`, `#86`, `#101`, `#102`, `#103`, ~~`#104`~~ (delivered), `#105`, `#111` - Test maturity: `#87`, `#88`, `#89` (property/fuzz pilot delivered; extended by `#717`), `#90`, `#91`; rigorous expansion wave tracker at `#721` - UX and onboarding maturity: `#92`, `#93`, `#94`, `#95` - Frontend responsiveness maturity: `#213` @@ -1005,7 +1005,7 @@ Covered by seeded issues: - MCP integration smoke/regression harness: `#141` (delivered) - Staged rollout policy (blue/green/canary): `#101` - SBOM/release provenance: `#103` -- Cost guardrails: `#104` +- Cost guardrails: `#104` (delivered 2026-04-09): cloud cost observability framework, feature cost hotspot registry, budget breach runbook, ADR-0023 - Backup/restore disaster recovery: `#86` - OpenTelemetry metrics/tracing and alerting runbook: `#68` - Load/concurrency harness and budgets: `#70` (delivered) From c3e1a58546306bd55d8141249f584a9bfbd6f084 Mon Sep 17 00:00:00 2001 From: Chris0Jeky Date: Thu, 9 Apr 2026 03:31:13 +0100 Subject: [PATCH 08/10] Fix adversarial self-review findings - Add LLM pricing verification caveat (prices are reference baselines, verify against current provider pages at deployment time) - Add DNS/Route 53 to network cost dimension - Add time_period_start and cost_filter note to Terraform budget template - Fix runbook mitigation action 1 to clarify global vs per-user controls - Correct superlinear scaling claim in ADR-0023 (SignalR is linear) - Reorder alert ownership to lead with solo-operator default - Add SQLite VACUUM safety caveat (exclusive lock, temp disk doubling) --- .../ADR-0023-cloud-cost-observability.md | 2 +- docs/ops/BUDGET_BREACH_RUNBOOK.md | 4 ++-- docs/ops/CLOUD_COST_OBSERVABILITY.md | 21 +++++++++++++------ docs/ops/COST_HOTSPOT_REGISTRY.md | 2 +- 4 files changed, 19 insertions(+), 10 deletions(-) diff --git a/docs/decisions/ADR-0023-cloud-cost-observability.md b/docs/decisions/ADR-0023-cloud-cost-observability.md index f63f3ea27..f6d13ea28 100644 --- a/docs/decisions/ADR-0023-cloud-cost-observability.md +++ b/docs/decisions/ADR-0023-cloud-cost-observability.md @@ -14,7 +14,7 @@ Three characteristics make proactive cost observability essential: 2. **Local-first heritage means no existing cloud cost discipline**: The team has never operated cloud infrastructure at scale. Without explicit budget guardrails, cost surprises are likely during the v0.2.0 cloud launch. -3. **Several features have superlinear cost scaling**: Logging volume, LLM token consumption, database storage, and SignalR connection counts all grow faster than user count under realistic usage patterns. +3. **Several features have superlinear or high-variance cost scaling**: LLM token consumption grows superlinearly with usage (tool-calling multiplies per-message cost), logging volume scales with request count and verbosity configuration, and database storage grows continuously with audit trail accumulation. Even linearly-scaling features like SignalR connections become cost-relevant at scale. Issue #104 (OPS-12) requires establishing cost visibility, budget alerting, and mitigation playbooks before cloud deployment begins. diff --git a/docs/ops/BUDGET_BREACH_RUNBOOK.md b/docs/ops/BUDGET_BREACH_RUNBOOK.md index ec06fa0dc..0e7fd76d6 100644 --- a/docs/ops/BUDGET_BREACH_RUNBOOK.md +++ b/docs/ops/BUDGET_BREACH_RUNBOOK.md @@ -118,7 +118,7 @@ Listed from least disruptive to most disruptive: | Priority | Action | Impact | How to execute | |---|---|---|---| -| 1 | Rate-limit top consumers | Affected users get 429 responses | Reduce `LlmQuota:RequestsPerHour` or `LlmQuota:TokensPerDay` for specific users via kill-switch | +| 1 | Tighten global rate limits | All users get stricter quotas | Reduce `LlmQuota:RequestsPerHour` or `LlmQuota:TokensPerDay` globally (these are global config keys, not per-user); individual abusive users can be blocked entirely via per-user kill-switch | | 2 | Reduce tool-calling rounds | Fewer tool calls per conversation, less capable but cheaper | Set `LlmToolCalling:MaxRounds` from 5 to 2-3 via config | | 3 | Switch to cheaper model | Potentially lower quality responses | Change `Llm:OpenAi:Model` to a cheaper variant | | 4 | Activate surface kill-switch | One LLM surface disabled (e.g., Chat only) | `POST /api/llm/kill-switch` with `KillSwitchScope: Surface` | @@ -148,7 +148,7 @@ Listed from least disruptive to most disruptive: | Priority | Action | Impact | How to execute | |---|---|---|---| -| 1 | Run SQLite VACUUM | Reclaims space from deleted records, brief lock | `sqlite3 /var/lib/taskdeck/taskdeck.db "VACUUM;"` | +| 1 | Run SQLite VACUUM | Reclaims space from deleted records; requires exclusive lock and temporarily doubles disk usage during execution — schedule during low-traffic window | `sqlite3 /var/lib/taskdeck/taskdeck.db "VACUUM;"` | | 2 | Reduce S3 version retention | Fewer backup versions kept | Lower noncurrent version expiry from 90 days | | 3 | Delete old export artifacts | Users lose access to old exports | Implement S3 lifecycle rule for export objects | | 4 | Archive old data | Audit trail or chat history moved to cold storage | Implement data archival pipeline (future work) | diff --git a/docs/ops/CLOUD_COST_OBSERVABILITY.md b/docs/ops/CLOUD_COST_OBSERVABILITY.md index 26d21fa41..25c1c10b1 100644 --- a/docs/ops/CLOUD_COST_OBSERVABILITY.md +++ b/docs/ops/CLOUD_COST_OBSERVABILITY.md @@ -42,7 +42,7 @@ Cloud costs are tracked across six dimensions. Each dimension maps to a billing |---|---| | Billing source | Provider API usage (OpenAI, Google Gemini) | | Application metric | `ILlmQuotaService` token usage records, `taskdeck.llm.tokens.used` | -| Current baseline | GPT-4o-mini: ~$0.15/1M input tokens, ~$0.60/1M output tokens; Gemini 2.5 Flash: ~$0.15/1M input tokens, ~$0.60/1M output tokens | +| Current baseline | GPT-4o-mini: ~$0.15/1M input tokens, ~$0.60/1M output tokens (reference baseline; verify against current OpenAI pricing). Gemini 2.5 Flash: pricing varies, verify against current Google pricing. | | Estimated monthly cost | $5-50 (light usage, 10-50 active users) to $200-500 (heavy usage, 100+ users with tool-calling) | | Scaling driver | Chat messages per user, tool-calling rounds per message (max 5), capture triage volume | @@ -62,10 +62,10 @@ LLM costs are the highest-variance dimension. See `docs/ops/COST_HOTSPOT_REGISTR | Attribute | Value | |---|---| -| Billing source | AWS data transfer out, inter-AZ traffic (if multi-AZ) | +| Billing source | AWS data transfer out, inter-AZ traffic (if multi-AZ), Route 53 hosted zones and DNS queries | | Application metric | Response payload sizes (approximated from API metrics) | -| Estimated monthly cost | $1-10 (single-AZ, moderate traffic) | -| Scaling driver | API response volume, SignalR WebSocket traffic, export downloads | +| Estimated monthly cost | $1-10 (single-AZ, moderate traffic) + ~$0.50/hosted zone/month for DNS | +| Scaling driver | API response volume, SignalR WebSocket traffic, export downloads, DNS query volume | ### 6. CI/CD and Artifact Storage @@ -114,6 +114,8 @@ These are starting points for a small-team deployment. Adjust after the first 2- ### Alert Owners +For a solo-operator deployment (the current Taskdeck posture), the operator owns all cost dimensions. The table below applies when the team scales to multiple roles: + | Cost dimension | Primary owner | Escalation | |---|---|---| | Compute | Infrastructure lead | Project maintainers | @@ -123,8 +125,6 @@ These are starting points for a small-team deployment. Adjust after the first 2- | Network | Infrastructure lead | Project maintainers | | CI/CD | DevOps lead | Project maintainers | -For a solo-operator deployment, all ownership defaults to the operator. - --- ## Monthly Cost Review Workflow @@ -220,6 +220,15 @@ resource "aws_budgets_budget" "taskdeck_monthly" { limit_unit = "USD" time_unit = "MONTHLY" + time_period_start = "2026-04-01_00:00" + + # Optional: scope budget to specific resources using cost filters. + # Uncomment and adapt if the AWS account hosts non-Taskdeck resources. + # cost_filter { + # name = "TagKeyValue" + # values = ["user:Project$taskdeck-${var.environment}"] + # } + notification { comparison_operator = "GREATER_THAN" threshold = 70 diff --git a/docs/ops/COST_HOTSPOT_REGISTRY.md b/docs/ops/COST_HOTSPOT_REGISTRY.md index c59247e7d..8ee5657ff 100644 --- a/docs/ops/COST_HOTSPOT_REGISTRY.md +++ b/docs/ops/COST_HOTSPOT_REGISTRY.md @@ -50,7 +50,7 @@ Each hotspot follows this structure: | Chat with 5 tool rounds (max) | ~5,500 | ~1,200 | ~$0.00155 | | Capture triage (per item) | ~300 | ~150 | ~$0.00014 | -These estimates assume GPT-4o-mini pricing ($0.15/1M input, $0.60/1M output). Gemini 2.5 Flash has similar pricing. Actual costs depend on conversation length, board context size, and tool result sizes. +These estimates assume approximate GPT-4o-mini pricing ($0.15/1M input, $0.60/1M output) as a reference baseline. Gemini 2.5 Flash pricing differs and should be checked against current Google pricing. All provider prices should be verified against the current pricing pages at deployment time — LLM pricing changes frequently. Actual costs depend on conversation length, board context size, and tool result sizes. ### Monthly Projections From 7aad8f8557a7e2b7504009b78dd5322413913c54 Mon Sep 17 00:00:00 2001 From: Chris0Jeky Date: Thu, 9 Apr 2026 03:57:52 +0100 Subject: [PATCH 09/10] Fix critical and high-severity review findings in cost observability docs - C1: Replace phantom LlmToolCalling:MaxRounds config references with accurate information (MaxRounds is a compile-time constant, not configurable) across runbook and hotspot registry - C2: Correct kill-switch API endpoint from /api/llm/kill-switch to /api/llm/killswitch (no hyphen) matching LlmQuotaController - C3: Fix kill-switch request body schema (enabled not active, include target field) in runbook emergency actions - H1: Correct instance types to match Terraform baseline (t3.small dev, t3.medium staging, t3.large prod) with adjusted cost estimates - H2: Add warnings that Global/Surface kill-switch API returns 403 (admin role not yet implemented) with config-based workarounds - H3: Fix duplicate logging mitigation lever and correct direction (increase MetricExportIntervalSeconds to reduce frequency) - M1: Add execution instructions for config-based rate limit changes - M3: Clarify dual timeout structure (60s total, 30s per-round) --- docs/ops/BUDGET_BREACH_RUNBOOK.md | 14 +++++++------- docs/ops/CLOUD_COST_OBSERVABILITY.md | 4 ++-- docs/ops/COST_HOTSPOT_REGISTRY.md | 6 +++--- 3 files changed, 12 insertions(+), 12 deletions(-) diff --git a/docs/ops/BUDGET_BREACH_RUNBOOK.md b/docs/ops/BUDGET_BREACH_RUNBOOK.md index 0e7fd76d6..a6b5a2e4a 100644 --- a/docs/ops/BUDGET_BREACH_RUNBOOK.md +++ b/docs/ops/BUDGET_BREACH_RUNBOOK.md @@ -72,7 +72,7 @@ Is the cost spike from LLM API usage? - Are there tool-calling loops (same tool called repeatedly with identical arguments)? - Is the `ClarificationDetector` being bypassed, causing extra rounds? 3. Check for configuration drift: - - Was `LlmToolCalling:MaxRounds` increased from the default? + - Was `LlmToolCalling:Enabled` changed? (Note: `MaxRounds` is a compile-time constant of 5 in `ToolCallingChatOrchestrator`, not configurable at runtime.) - Was `LlmQuota:GlobalBudgetCeilingTokens` raised or removed? - Was a more expensive model configured (e.g., GPT-4o instead of GPT-4o-mini)? 4. Check LLM provider dashboard (OpenAI/Gemini) for independent cost confirmation. @@ -118,12 +118,12 @@ Listed from least disruptive to most disruptive: | Priority | Action | Impact | How to execute | |---|---|---|---| -| 1 | Tighten global rate limits | All users get stricter quotas | Reduce `LlmQuota:RequestsPerHour` or `LlmQuota:TokensPerDay` globally (these are global config keys, not per-user); individual abusive users can be blocked entirely via per-user kill-switch | -| 2 | Reduce tool-calling rounds | Fewer tool calls per conversation, less capable but cheaper | Set `LlmToolCalling:MaxRounds` from 5 to 2-3 via config | +| 1 | Tighten global rate limits | All users get stricter quotas | Reduce `LlmQuota:RequestsPerHour` or `LlmQuota:TokensPerDay` in appsettings.json and restart API, or set environment variables (e.g., `LlmQuota__RequestsPerHour=30`) and restart. These are global config keys affecting all users. Individual abusive users can be blocked entirely via per-user kill-switch. | +| 2 | Reduce tool-calling rounds | Fewer tool calls per conversation, less capable but cheaper | `MaxRounds` is currently a compile-time constant (5) in `ToolCallingChatOrchestrator`. Requires code change and redeployment. **Gap**: consider making this configurable via `LlmToolCalling:MaxRounds` (see backlog). As a workaround, disable tool-calling entirely by setting `LlmToolCalling:Enabled` to `false` in appsettings and restarting. | | 3 | Switch to cheaper model | Potentially lower quality responses | Change `Llm:OpenAi:Model` to a cheaper variant | -| 4 | Activate surface kill-switch | One LLM surface disabled (e.g., Chat only) | `POST /api/llm/kill-switch` with `KillSwitchScope: Surface` | -| 5 | Activate per-user kill-switch | Specific abusive user blocked from LLM | `POST /api/llm/kill-switch` with `KillSwitchScope: Identity` | -| 6 | Activate global kill-switch | All LLM features disabled; non-LLM features unaffected | `POST /api/llm/kill-switch` with `KillSwitchScope: Global` | +| 4 | Activate surface kill-switch | One LLM surface disabled (e.g., Chat only) | **API requires admin role (not yet implemented; returns 403).** Workaround: add surface name to `LlmKillSwitch:KilledSurfaces` array in appsettings (e.g., `["Chat"]`) and restart API. When admin API is available: `POST /api/llm/killswitch` with body `{ "scope": "Surface", "target": "", "enabled": true, "reason": "..." }` | +| 5 | Activate per-user kill-switch | Specific abusive user blocked from LLM | `POST /api/llm/killswitch` with body `{ "scope": "Identity", "target": "", "enabled": true, "reason": "..." }` (caller can only target their own userId; admin cross-user support pending) | +| 6 | Activate global kill-switch | All LLM features disabled; non-LLM features unaffected | **API requires admin role (not yet implemented; returns 403).** Workaround: set `LlmKillSwitch:GlobalKill` to `true` in appsettings and restart API. When admin API is available: `POST /api/llm/killswitch` with body `{ "scope": "Global", "target": null, "enabled": true, "reason": "..." }` | | 7 | Switch all users to Mock provider | LLM features return deterministic mock responses | Set `Llm:Provider` to `Mock`, restart API | ### Logging Cost Mitigation Actions @@ -200,7 +200,7 @@ For use when immediate action is needed and there is no time for full triage: | Scenario | Immediate action | Command / Config | |---|---|---| -| LLM cost runaway | Activate global kill-switch | `POST /api/llm/kill-switch` — `{ "scope": "Global", "active": true, "reason": "Cost emergency" }` | +| LLM cost runaway | Activate global kill-switch | **Note**: Global scope requires admin role (not yet implemented via API; returns 403). **Workaround**: set `LlmKillSwitch:GlobalKill` to `true` in appsettings.json and restart API. If admin API is available: `POST /api/llm/killswitch` with body `{ "scope": "Global", "target": null, "enabled": true, "reason": "Cost emergency" }` | | Logging cost spike | Raise log level to Error | Set `Logging:LogLevel:Default` to `Error`, restart API | | Storage filling up | Identify and remove large files | `du -sh /var/lib/taskdeck/*` then assess | | Unknown cost source | Check AWS Cost Explorer | AWS Console → Billing → Cost Explorer → Group by Service | diff --git a/docs/ops/CLOUD_COST_OBSERVABILITY.md b/docs/ops/CLOUD_COST_OBSERVABILITY.md index 25c1c10b1..c71d2bb77 100644 --- a/docs/ops/CLOUD_COST_OBSERVABILITY.md +++ b/docs/ops/CLOUD_COST_OBSERVABILITY.md @@ -21,9 +21,9 @@ Cloud costs are tracked across six dimensions. Each dimension maps to a billing | Attribute | Value | |---|---| | Billing source | AWS EC2 on-demand or reserved instance hours | -| Current baseline | Single `t3.medium` (dev), `t3.large` (staging/prod) per `DEPLOYMENT_TERRAFORM_BASELINE.md` | +| Current baseline | Single `t3.small` (dev), `t3.medium` (staging), `t3.large` (prod) per Terraform env tfvars examples | | Application metric | None (infrastructure-level only) | -| Estimated monthly cost | $30-70 (single-node, on-demand) | +| Estimated monthly cost | $15-70 (single-node, on-demand: ~$15 t3.small, ~$30 t3.medium, ~$60 t3.large) | | Scaling driver | User concurrency, background worker load | ### 2. Storage (EBS + S3) diff --git a/docs/ops/COST_HOTSPOT_REGISTRY.md b/docs/ops/COST_HOTSPOT_REGISTRY.md index 8ee5657ff..391f80c3b 100644 --- a/docs/ops/COST_HOTSPOT_REGISTRY.md +++ b/docs/ops/COST_HOTSPOT_REGISTRY.md @@ -35,8 +35,8 @@ Each hotspot follows this structure: | Cost dimension | LLM API (OpenAI / Gemini) | | Estimated cost range | $5-50/month (10-50 users, light chat) to $200-500/month (100+ users, heavy tool-calling) | | Scaling behavior | **Superlinear** — each chat message may trigger 1-5 tool-calling rounds, each round is a full API call with growing context window. A single complex conversation can cost 5-10x a simple one. Capture triage adds per-item LLM cost. | -| Current guardrails | Per-user rate limit: 60 req/hr. Per-user token limit: 100K tokens/day. Global budget ceiling config (`LlmQuota:GlobalBudgetCeilingTokens`). Tool-calling loop cap: 5 rounds, 60s timeout. Tool result truncation: 8KB max. Kill-switch (global/surface/per-user). Mock provider default (zero cost). | -| Mitigation levers | 1. Reduce `LlmToolCalling:MaxRounds` (default 5 → 3). 2. Lower per-user token daily limit. 3. Switch high-volume users to Mock provider. 4. Activate surface-level kill-switch for Chat or CaptureTriage. 5. Reduce context window size (`BoardContextBuilder` budget). 6. Switch from GPT-4o-mini to a cheaper model. 7. Enable clarification detection to reduce wasted rounds (`ClarificationDetector`). | +| Current guardrails | Per-user rate limit: 60 req/hr. Per-user token limit: 100K tokens/day. Global budget ceiling config (`LlmQuota:GlobalBudgetCeilingTokens`). Tool-calling loop cap: 5 rounds (compile-time constant), 60s total orchestration timeout, 30s per-round timeout. Tool result truncation: 8KB max (`LlmToolCalling:MaxToolResultBytes`). Kill-switch (global/surface/per-user). Mock provider default (zero cost). | +| Mitigation levers | 1. Disable tool-calling entirely via `LlmToolCalling:Enabled` = `false` (note: `MaxRounds` is a compile-time constant of 5, not runtime-configurable — making it configurable is a backlog gap). 2. Lower per-user token daily limit (`LlmQuota:TokensPerDay`). 3. Switch high-volume users to Mock provider. 4. Activate surface-level kill-switch for Chat or CaptureTriage. 5. Reduce context window size (`BoardContextBuilder` budget). 6. Switch from GPT-4o-mini to a cheaper model (`Llm:OpenAi:Model`). 7. Enable clarification detection to reduce wasted rounds (`ClarificationDetector`). | | Action owner | Product/backend lead | | Risk level | **High** — highest variance cost component with no natural ceiling per conversation | @@ -72,7 +72,7 @@ These estimates assume approximate GPT-4o-mini pricing ($0.15/1M input, $0.60/1M | Estimated cost range | $5-30/month (structured, sampled) to $100-300/month (verbose, unsampled) | | Scaling behavior | **Linear to superlinear** — log volume scales with request count. Verbose logging (DEBUG level) or high-cardinality trace attributes can cause 10-50x volume increase. Tool-calling conversations generate multiple log entries per round. | | Current guardrails | Configurable log level. Security logging redaction baseline (sanitized exceptions, generic error messages). Configurable OTLP exporter. Metric export interval configurable. | -| Mitigation levers | 1. Set log level to `Warning` or `Error` in production. 2. Enable trace sampling (e.g., 10% of requests). 3. Reduce metric export interval. 4. Reduce `MetricExportIntervalSeconds`. 5. Set CloudWatch log retention to 14-30 days (not indefinite). 6. Exclude health-check endpoints from trace collection. 7. Cap log line length for tool-call results. | +| Mitigation levers | 1. Set log level to `Warning` or `Error` in production (`Logging:LogLevel:Default`). 2. Enable trace sampling (e.g., 10% of requests). 3. Increase metric export interval to reduce export frequency (e.g., `Observability:MetricExportIntervalSeconds` from 30 to 120). 4. Set CloudWatch log retention to 14-30 days (not indefinite). 5. Exclude health-check endpoints from trace collection. 6. Cap log line length for tool-call results. | | Action owner | Infrastructure lead | | Risk level | **Medium** — predictable at low volume but can spike with verbose config or traffic surges | From 4d3af4558889124bc14ba0c72f0724b6d0dca3b0 Mon Sep 17 00:00:00 2001 From: Chris0Jeky Date: Sun, 12 Apr 2026 01:19:12 +0100 Subject: [PATCH 10/10] Fix BUDGET_BREACH_RUNBOOK.md Copilot review comments - Clarify that per-user throttling uses Identity scope kill-switch (POST /api/llm/killswitch with scope: Identity), not LlmQuota config keys which are global-only - Document that MaxRounds is a compile-time constant requiring code change and redeployment, not a runtime config knob - Add notes that Global and Surface scope kill-switch operations return 403 until admin roles are implemented, with config fallback alternatives (LlmKillSwitch:GlobalKill, LlmKillSwitch:SurfaceKills) --- docs/ops/BUDGET_BREACH_RUNBOOK.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/ops/BUDGET_BREACH_RUNBOOK.md b/docs/ops/BUDGET_BREACH_RUNBOOK.md index 9175916b5..7728ed126 100644 --- a/docs/ops/BUDGET_BREACH_RUNBOOK.md +++ b/docs/ops/BUDGET_BREACH_RUNBOOK.md @@ -118,12 +118,12 @@ Listed from least disruptive to most disruptive: | Priority | Action | Impact | How to execute | |---|---|---|---| -| 1 | Tighten global rate limits | All users get stricter quotas | Reduce `LlmQuota:RequestsPerHour` or `LlmQuota:TokensPerDay` globally (these are global config keys, not per-user); individual abusive users can be blocked entirely via per-user kill-switch | -| 2 | Reduce tool-calling rounds | Fewer tool calls per conversation, less capable but cheaper | Disable tool-calling via `LlmToolCalling:Enabled = false` or ship a code change to lower `ToolCallingChatOrchestrator.MaxRounds`; there is no runtime `MaxRounds` config knob today | +| 1 | Tighten global rate limits | All users get stricter quotas | Reduce `LlmQuota:RequestsPerHour` or `LlmQuota:TokensPerDay` globally (these are global config keys, not per-user); individual abusive users can be blocked entirely via per-user Identity kill-switch (`POST /api/llm/killswitch` with `scope: Identity`) | +| 2 | Reduce tool-calling rounds | Fewer tool calls per conversation, less capable but cheaper | Disable tool-calling via `LlmToolCalling:Enabled = false`; `MaxRounds` is a compile-time constant (`ToolCallingChatOrchestrator.MaxRounds = 5`) and cannot be changed at runtime -- lowering it requires a code change and redeployment | | 3 | Switch to cheaper model | Potentially lower quality responses | Change `Llm:OpenAi:Model` to a cheaper variant | -| 4 | Activate surface kill-switch | One LLM surface disabled (e.g., Chat only) | `POST /api/llm/killswitch` with `{ "scope": "Surface", "target": "Chat", "enabled": true, "reason": "Cost emergency" }` (currently returns 403 until admin support exists) | -| 5 | Activate per-user kill-switch | Specific abusive user blocked from LLM | `POST /api/llm/killswitch` with `{ "scope": "Identity", "target": "", "enabled": true, "reason": "Cost emergency" }` | -| 6 | Activate global kill-switch | All LLM features disabled; non-LLM features unaffected | `POST /api/llm/killswitch` with `{ "scope": "Global", "target": null, "enabled": true, "reason": "Cost emergency" }` (currently returns 403 until admin support exists; use the `LlmKillSwitch__GlobalKill` config fallback where appropriate) | +| 4 | Activate surface kill-switch | One LLM surface disabled (e.g., Chat only) | `POST /api/llm/killswitch` with `{ "scope": "Surface", "target": "Chat", "enabled": true, "reason": "Cost emergency" }` -- **Note:** Currently returns 403 until admin roles are implemented; use `LlmKillSwitch:SurfaceKills:` config as fallback | +| 5 | Activate per-user kill-switch | Specific abusive user blocked from LLM | `POST /api/llm/killswitch` with `{ "scope": "Identity", "target": "", "enabled": true, "reason": "Cost emergency" }` -- users can only set this for themselves; admin-scoped cross-user blocking requires admin roles (not yet implemented) | +| 6 | Activate global kill-switch | All LLM features disabled; non-LLM features unaffected | `POST /api/llm/killswitch` with `{ "scope": "Global", "target": null, "enabled": true, "reason": "Cost emergency" }` -- **Note:** Currently returns 403 until admin roles are implemented; use `LlmKillSwitch:GlobalKill=true` config as fallback | | 7 | Switch all users to Mock provider | LLM features return deterministic mock responses | Set `Llm:Provider` to `Mock`, restart API | ### Logging Cost Mitigation Actions @@ -200,7 +200,7 @@ For use when immediate action is needed and there is no time for full triage: | Scenario | Immediate action | Command / Config | |---|---|---| -| LLM cost runaway | Activate global kill-switch | `POST /api/llm/killswitch` - `{ "scope": "Global", "target": null, "enabled": true, "reason": "Cost emergency" }` | +| LLM cost runaway | Activate global kill-switch | Set `LlmKillSwitch:GlobalKill=true` in config and restart API (the `POST /api/llm/killswitch` endpoint with `scope: Global` returns 403 until admin roles are implemented) | | Logging cost spike | Raise log level to Error | Set `Logging:LogLevel:Default` to `Error`, restart API | | Storage filling up | Identify and remove large files | `du -sh /var/lib/taskdeck/*` then assess | | Unknown cost source | Check AWS Cost Explorer | AWS Console → Billing → Cost Explorer → Group by Service |