Skip to content

OPS-12: Cloud cost observability and budget-guardrail automation#798

Merged
Chris0Jeky merged 14 commits intomainfrom
ops/cost-guardrails-budget-observability
Apr 12, 2026
Merged

OPS-12: Cloud cost observability and budget-guardrail automation#798
Chris0Jeky merged 14 commits intomainfrom
ops/cost-guardrails-budget-observability

Conversation

@Chris0Jeky
Copy link
Copy Markdown
Owner

Summary

Closes #104 (OPS-12). Establishes proactive cloud cost observability for Taskdeck's transition from local-first to hosted deployment.

  • ADR-0023: Documents the decision to establish three-layer cost observability (telemetry + budget alerts + hotspot tracking) with alternatives considered (reactive-only, third-party tools, hard caps)
  • Cloud cost observability framework (docs/ops/CLOUD_COST_OBSERVABILITY.md): Six cost dimensions (compute, storage, LLM API, logging, network, CI/CD), three-tier budget alerts (70%/90%/100%), monthly review workflow with checklist, anomaly triage process, dashboard recommendations, and Terraform budget alert template
  • Feature cost hotspot registry (docs/ops/COST_HOTSPOT_REGISTRY.md): Six high-variance features with per-request LLM cost estimates, monthly projections, scaling behavior, current guardrails, mitigation levers, and action owners
  • Budget breach runbook (docs/ops/BUDGET_BREACH_RUNBOOK.md): Five-phase playbook (detection, triage with decision tree, graduated mitigation, stabilization, post-incident review) with quick-reference emergency actions
  • STATUS.md and IMPLEMENTATION_MASTERPLAN.md updated

Test plan

  • Verify all cross-references between docs are valid (ADR-0023 links, ops doc mutual references)
  • Verify ADR-0023 appears in INDEX.md with correct number and status
  • Verify LLM cost estimates reference actual supported providers (OpenAI GPT-4o-mini, Gemini 2.5 Flash)
  • Verify mitigation actions reference actual Taskdeck config keys and API endpoints
  • Verify runbook phases are actionable given the current architecture (single-node, SQLite, in-process workers)
  • No code changes — docs-only PR

Document the decision to establish proactive cost observability for
Taskdeck's cloud transition. Covers three-layer approach (telemetry,
budget alerts, feature-level hotspot tracking), alternatives considered,
and consequences.
Define cost telemetry dimensions (compute, storage, LLM API, logging,
network, CI/CD), three-tier budget alert thresholds (70%/90%/100%),
monthly review workflow with checklist, anomaly triage process,
dashboard recommendations, and Terraform budget alert template.
Document six high-variance cost features with estimated cost ranges,
scaling behavior, current guardrails, mitigation levers, and action
owners: LLM API usage, logging/telemetry, database storage, SignalR
connections, CI/CD pipelines, and MCP transport.
Five-phase playbook covering detection, triage (with decision tree for
LLM/logging/compute/storage root causes), graduated mitigation actions,
stabilization checks, and post-incident review process. Includes
quick-reference emergency actions table.
Mark #104 as delivered with summary of framework, hotspot registry,
runbook, and ADR-0023.
Update Out-of-Code coverage matrix and Priority IV backlog entries
for cost guardrails issue.
Copilot AI review requested due to automatic review settings April 9, 2026 02:28
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@Chris0Jeky
Copy link
Copy Markdown
Owner Author

Adversarial Self-Review Findings

1. LLM cost estimates need provider-specific precision (MEDIUM)

File: docs/ops/COST_HOTSPOT_REGISTRY.md, Hotspot 1 table
Issue: The per-request cost estimates state "GPT-4o-mini: ~$0.15/1M input tokens, ~$0.60/1M output tokens" and "Gemini 2.5 Flash: ~$0.15/1M input tokens, ~$0.60/1M output tokens". These are reasonable 2024/2025 prices but may be outdated by 2026-04. More importantly, the Gemini pricing is presented as identical to OpenAI, which is unlikely — Gemini pricing has historically been different.
Fix: Add a caveat that these are approximate reference prices that should be verified against current provider pricing pages at deployment time. Replace the assertion that prices are identical.

2. Missing cost dimension: DNS and domain registration (LOW)

File: docs/ops/CLOUD_COST_OBSERVABILITY.md
Issue: The framework covers 6 cost dimensions but omits DNS (Route 53 hosted zones, ~$0.50/zone/month + query charges) and domain registration costs. While small, they are recurring and should at least be noted.
Fix: Add a brief mention under the Network dimension or as a note.

3. Terraform template missing time_period configuration (LOW)

File: docs/ops/CLOUD_COST_OBSERVABILITY.md, Terraform Budget Alert Template
Issue: The aws_budgets_budget resource does not specify time_period start/end dates. AWS Budgets requires this or defaults may be unexpected. Also missing cost_filter to scope the budget to the specific environment's resources.
Fix: Add time_period block and a note about cost filters.

4. Runbook mitigation action 1 references per-user config that does not exist (MEDIUM)

File: docs/ops/BUDGET_BREACH_RUNBOOK.md, LLM Cost Mitigation Actions table, Priority 1
Issue: "Reduce LlmQuota:RequestsPerHour or LlmQuota:TokensPerDay for specific users via kill-switch" — the kill-switch is a binary on/off per user, not a per-user quota adjustment. The existing LlmQuota:RequestsPerHour and LlmQuota:TokensPerDay config keys are global defaults, not per-user overrides. The mitigation action as written conflates two different mechanisms.
Fix: Clarify that the kill-switch blocks a user entirely, while the global quota config keys affect all users. For per-user throttling, the current mechanism is the per-user kill-switch (block), not granular quota reduction.

5. ADR context claim about SignalR scaling is slightly misleading (LOW)

File: docs/decisions/ADR-0023-cloud-cost-observability.md, Context section, point 3
Issue: "SignalR connection counts all grow faster than user count" — this is not accurate. Each user maintains exactly one WebSocket connection. Connection count grows linearly with concurrent users, not superlinearly. The memory footprint per connection is what could become significant, but that is linear.
Fix: Remove SignalR from the "superlinear" claim in the ADR context. Keep it as a hotspot for linear scaling in the registry.

6. Monthly review ownership is ambiguous for the common case (LOW)

File: docs/ops/CLOUD_COST_OBSERVABILITY.md, Alert Owners table
Issue: The table assigns ownership to "Infrastructure lead", "Product/backend lead", "DevOps lead" — but the note that "For a solo-operator deployment, all ownership defaults to the operator" should be more prominent since Taskdeck is currently a solo-developer project. The multi-role ownership table may give a false impression of team structure.
Fix: Reorder to lead with the solo-operator default and present the multi-role table as the scaled-team variant.

7. Storage VACUUM mitigation missing safety caveat (LOW)

File: docs/ops/BUDGET_BREACH_RUNBOOK.md, Storage Cost Mitigation Actions, Priority 1
Issue: sqlite3 /var/lib/taskdeck/taskdeck.db "VACUUM;" can temporarily double database file size during execution and holds an exclusive lock. On a running production system, this could cause brief request failures.
Fix: Add a note about the temporary space requirement and exclusive lock.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a comprehensive cloud cost observability framework, delivering the cost guardrails specified in issue #104. It introduces ADR-0023, a budget breach runbook, a feature cost hotspot registry, and detailed telemetry dimensions for compute, storage, and LLM usage. Review feedback focuses on correcting inaccuracies regarding Gemini model versions and pricing, resolving discrepancies in documented rate limits, and addressing semantic inconsistencies in the mitigation runbook. Additionally, there is a recommendation to refine the daily budget alert logic to better accommodate bursty LLM usage patterns and avoid alert fatigue.


| Priority | Action | Impact | How to execute |
|---|---|---|---|
| 1 | Rate-limit top consumers | Affected users get 429 responses | Reduce `LlmQuota:RequestsPerHour` or `LlmQuota:TokensPerDay` for specific users via kill-switch |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The instruction to "Reduce LlmQuota:RequestsPerHour ... via kill-switch" is semantically inconsistent. A "kill-switch" typically implies a binary block (on/off), whereas "Reduce" implies a numeric adjustment. Additionally, if LlmQuota:RequestsPerHour is a global configuration key, it cannot be applied to "specific users" unless the system supports per-user configuration overrides. If the intention is to block the user entirely, "Block top consumers via per-user kill-switch" would be clearer.

|---|---|
| Billing source | Provider API usage (OpenAI, Google Gemini) |
| Application metric | `ILlmQuotaService` token usage records, `taskdeck.llm.tokens.used` |
| Current baseline | GPT-4o-mini: ~$0.15/1M input tokens, ~$0.60/1M output tokens; Gemini 2.5 Flash: ~$0.15/1M input tokens, ~$0.60/1M output tokens |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The version "Gemini 2.5 Flash" and the associated pricing appear to be inaccurate. Gemini 1.5 Flash is the current version and is typically priced significantly lower than GPT-4o-mini (approx. $0.075/1M input and $0.30/1M output). Please verify the model version and pricing to ensure the cost framework's estimates are reliable.

Suggested change
| Current baseline | GPT-4o-mini: ~$0.15/1M input tokens, ~$0.60/1M output tokens; Gemini 2.5 Flash: ~$0.15/1M input tokens, ~$0.60/1M output tokens |
| Current baseline | GPT-4o-mini: ~$0.15/1M input tokens, ~$0.60/1M output tokens; Gemini 1.5 Flash: ~$0.075/1M input tokens, ~$0.30/1M output tokens |

**Application-level LLM cost alerts** (supplementary):

- The existing `ILlmQuotaService` tracks per-user token consumption.
- Add a daily aggregate check: if total LLM token spend across all users exceeds `(monthly_budget * 0.70) / 30` on any single day, emit a warning log and optional webhook notification.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The daily aggregate check logic ((monthly_budget * 0.70) / 30) assumes linear consumption of the monthly budget. Since LLM usage is often bursty (e.g., a single user performing a large batch operation), this threshold might trigger false-positive warnings early in the month even if the total monthly spend is on track. Consider using a month-to-date projection or a higher daily buffer for the "Warning" tier to avoid alert fatigue.

| Cost dimension | LLM API (OpenAI / Gemini) |
| Estimated cost range | $5-50/month (10-50 users, light chat) to $200-500/month (100+ users, heavy tool-calling) |
| Scaling behavior | **Superlinear** — each chat message may trigger 1-5 tool-calling rounds, each round is a full API call with growing context window. A single complex conversation can cost 5-10x a simple one. Capture triage adds per-item LLM cost. |
| Current guardrails | Per-user rate limit: 60 req/hr. Per-user token limit: 100K tokens/day. Global budget ceiling config (`LlmQuota:GlobalBudgetCeilingTokens`). Tool-calling loop cap: 5 rounds, 60s timeout. Tool result truncation: 8KB max. Kill-switch (global/surface/per-user). Mock provider default (zero cost). |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a discrepancy in the documented rate limit. This line mentions "60 req/hr", but line 156 of this same file and line 411 of docs/STATUS.md specify "60 req/60s" (which is 3600 req/hr). Please align the documentation with the actual policy.

Suggested change
| Current guardrails | Per-user rate limit: 60 req/hr. Per-user token limit: 100K tokens/day. Global budget ceiling config (`LlmQuota:GlobalBudgetCeilingTokens`). Tool-calling loop cap: 5 rounds, 60s timeout. Tool result truncation: 8KB max. Kill-switch (global/surface/per-user). Mock provider default (zero cost). |
| Current guardrails | Per-user rate limit: 60 req/60s. Per-user token limit: 100K tokens/day. Global budget ceiling config (`LlmQuota:GlobalBudgetCeilingTokens`). Tool-calling loop cap: 5 rounds, 60s timeout. Tool result truncation: 8KB max. Kill-switch (global/surface/per-user). Mock provider default (zero cost). |

| Chat with 5 tool rounds (max) | ~5,500 | ~1,200 | ~$0.00155 |
| Capture triage (per item) | ~300 | ~150 | ~$0.00014 |

These estimates assume GPT-4o-mini pricing ($0.15/1M input, $0.60/1M output). Gemini 2.5 Flash has similar pricing. Actual costs depend on conversation length, board context size, and tool result sizes.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

As noted in the observability framework, the Gemini version and pricing estimates here seem to be copy-pasted from the OpenAI entry and do not reflect the actual lower price point of Gemini 1.5 Flash.

Suggested change
These estimates assume GPT-4o-mini pricing ($0.15/1M input, $0.60/1M output). Gemini 2.5 Flash has similar pricing. Actual costs depend on conversation length, board context size, and tool result sizes.
These estimates assume GPT-4o-mini pricing ($0.15/1M input, $0.60/1M output). Gemini 1.5 Flash has lower pricing (~$0.075/1M input, ~$0.30/1M output). Actual costs depend on conversation length, board context size, and tool result sizes.

- Add LLM pricing verification caveat (prices are reference baselines,
  verify against current provider pages at deployment time)
- Add DNS/Route 53 to network cost dimension
- Add time_period_start and cost_filter note to Terraform budget template
- Fix runbook mitigation action 1 to clarify global vs per-user controls
- Correct superlinear scaling claim in ADR-0023 (SignalR is linear)
- Reorder alert ownership to lead with solo-operator default
- Add SQLite VACUUM safety caveat (exclusive lock, temp disk doubling)
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an ops documentation suite for proactive cloud cost observability as Taskdeck transitions toward hosted deployments, including an ADR, budget alerting model, hotspot registry, and breach runbook.

Changes:

  • Introduces ADR-0023 and links it in the decisions index.
  • Adds a cloud cost observability framework (dimensions, thresholds, review/triage workflow, dashboard guidance, Terraform budget template).
  • Adds a feature cost hotspot registry and a budget breach runbook; marks issue #104 as delivered in planning/status docs.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
docs/STATUS.md Marks OPS-12 / #104 as delivered with a summary of artifacts.
docs/ops/COST_HOTSPOT_REGISTRY.md New hotspot registry with scaling drivers, guardrails, mitigation levers, owners.
docs/ops/CLOUD_COST_OBSERVABILITY.md New cost observability framework with dimensions, alert tiers, workflow, and Terraform example.
docs/ops/BUDGET_BREACH_RUNBOOK.md New runbook for responding to budget breaches with triage + mitigations.
docs/IMPLEMENTATION_MASTERPLAN.md Updates platform/ops maturity tracker to mark #104 delivered.
docs/decisions/INDEX.md Adds ADR-0023 to the ADR index.
docs/decisions/ADR-0023-cloud-cost-observability.md New ADR documenting the decision and alternatives.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


| Priority | Action | Impact | How to execute |
|---|---|---|---|
| 1 | Tighten global rate limits | All users get stricter quotas | Reduce `LlmQuota:RequestsPerHour` or `LlmQuota:TokensPerDay` globally (these are global config keys, not per-user); individual abusive users can be blocked entirely via per-user kill-switch |
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runbook suggests reducing LlmQuota:RequestsPerHour / LlmQuota:TokensPerDay “for specific users via kill-switch”, but the current kill-switch implementation only blocks LLM access (it doesn’t adjust per-user quota limits). Update this to either (a) use the Identity kill-switch to block abusive users, or (b) describe changing LlmQuota settings globally via config and restart.

Copilot uses AI. Check for mistakes.
| Priority | Action | Impact | How to execute |
|---|---|---|---|
| 1 | Tighten global rate limits | All users get stricter quotas | Reduce `LlmQuota:RequestsPerHour` or `LlmQuota:TokensPerDay` globally (these are global config keys, not per-user); individual abusive users can be blocked entirely via per-user kill-switch |
| 2 | Reduce tool-calling rounds | Fewer tool calls per conversation, less capable but cheaper | Set `LlmToolCalling:MaxRounds` from 5 to 2-3 via config |
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LlmToolCalling:MaxRounds isn’t a configurable setting in the codebase right now (tool-calling rounds are capped by the ToolCallingChatOrchestrator.MaxRounds constant). Either adjust the runbook to reflect that this requires a code change, or introduce/configure a LlmToolCalling setting for max rounds if you want this to be an operational lever.

Copilot uses AI. Check for mistakes.
Comment on lines +124 to +126
| 4 | Activate surface kill-switch | One LLM surface disabled (e.g., Chat only) | `POST /api/llm/kill-switch` with `KillSwitchScope: Surface` |
| 5 | Activate per-user kill-switch | Specific abusive user blocked from LLM | `POST /api/llm/kill-switch` with `KillSwitchScope: Identity` |
| 6 | Activate global kill-switch | All LLM features disabled; non-LLM features unaffected | `POST /api/llm/kill-switch` with `KillSwitchScope: Global` |
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documented endpoint POST /api/llm/kill-switch doesn’t match the current API routes (controller is [Route("api/llm")] with [HttpPost("killswitch")], i.e. POST /api/llm/killswitch). Also note that Global/Surface scopes currently return 403 until admin roles are implemented, so these actions as written aren’t executable via the API.

Copilot uses AI. Check for mistakes.

| Scenario | Immediate action | Command / Config |
|---|---|---|
| LLM cost runaway | Activate global kill-switch | `POST /api/llm/kill-switch` — `{ "scope": "Global", "active": true, "reason": "Cost emergency" }` |
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quick-reference payload uses { "active": true }, but the API DTO uses enabled (see SetKillSwitchRequestDto.Enabled). Also the endpoint path is POST /api/llm/killswitch (no /kill-switch). Update the example so an operator can copy/paste it successfully.

Copilot uses AI. Check for mistakes.
| Estimated cost range | $5-50/month (10-50 users, light chat) to $200-500/month (100+ users, heavy tool-calling) |
| Scaling behavior | **Superlinear** — each chat message may trigger 1-5 tool-calling rounds, each round is a full API call with growing context window. A single complex conversation can cost 5-10x a simple one. Capture triage adds per-item LLM cost. |
| Current guardrails | Per-user rate limit: 60 req/hr. Per-user token limit: 100K tokens/day. Global budget ceiling config (`LlmQuota:GlobalBudgetCeilingTokens`). Tool-calling loop cap: 5 rounds, 60s timeout. Tool result truncation: 8KB max. Kill-switch (global/surface/per-user). Mock provider default (zero cost). |
| Mitigation levers | 1. Reduce `LlmToolCalling:MaxRounds` (default 5 → 3). 2. Lower per-user token daily limit. 3. Switch high-volume users to Mock provider. 4. Activate surface-level kill-switch for Chat or CaptureTriage. 5. Reduce context window size (`BoardContextBuilder` budget). 6. Switch from GPT-4o-mini to a cheaper model. 7. Enable clarification detection to reduce wasted rounds (`ClarificationDetector`). |
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mitigation lever references LlmToolCalling:MaxRounds, but the current LlmToolCalling config only supports Enabled and MaxToolResultBytes; max rounds is a hard-coded constant (ToolCallingChatOrchestrator.MaxRounds = 5). Either update the doc to match current behavior or add a configurable max-rounds setting if it’s intended to be an ops control.

Copilot uses AI. Check for mistakes.
Comment on lines +43 to +45
| Billing source | Provider API usage (OpenAI, Google Gemini) |
| Application metric | `ILlmQuotaService` token usage records, `taskdeck.llm.tokens.used` |
| Current baseline | GPT-4o-mini: ~$0.15/1M input tokens, ~$0.60/1M output tokens (reference baseline; verify against current OpenAI pricing). Gemini 2.5 Flash: pricing varies, verify against current Google pricing. |
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric name taskdeck.llm.tokens.used is referenced here, but there is no corresponding metric defined in TaskdeckTelemetry (current meters cover worker/queue metrics only). Either point this doc at the persisted ILlmQuotaService usage records as the source of truth, or add an explicit LLM token metric and update the name accordingly.

Copilot uses AI. Check for mistakes.
Comment on lines +25 to +26
1. **Cost telemetry and dashboards**: Define cost dimensions (compute, storage, LLM API, logging, network), track them through cloud provider billing APIs and application-level metrics, and maintain a monthly cost review workflow.

Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ADR lists cost dimensions as “compute, storage, LLM API, logging, network”, but the framework in docs/ops/CLOUD_COST_OBSERVABILITY.md tracks six dimensions (also including CI/CD). Consider updating the ADR to include CI/CD so the decision record matches the documented framework.

Copilot uses AI. Check for mistakes.

2. **Local-first heritage means no existing cloud cost discipline**: The team has never operated cloud infrastructure at scale. Without explicit budget guardrails, cost surprises are likely during the v0.2.0 cloud launch.

3. **Several features have superlinear or high-variance cost scaling**: LLM token consumption grows superlinearly with usage (tool-calling multiplies per-message cost), logging volume scales with request count and verbosity configuration, and database storage grows continuously with audit trail accumulation. Even linearly-scaling features like SignalR connections become cost-relevant at scale.
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence describes SignalR connection counts as “superlinear / faster than user count”, but SignalR connections generally scale linearly with concurrent users (and the hotspot registry later describes SignalR as linear). Consider revising to avoid overstating this driver compared to truly superlinear behaviors (e.g., tool-calling rounds/context growth).

Copilot uses AI. Check for mistakes.
@Chris0Jeky
Copy link
Copy Markdown
Owner Author

Adversarial Review - PR #798

I verified every code reference, config key, API endpoint, and cross-reference in this PR against the actual codebase. Several findings require correction before merge.


CRITICAL

C1. LlmToolCalling:MaxRounds config key does not exist -- multiple docs reference a phantom setting

Files: docs/ops/BUDGET_BREACH_RUNBOOK.md (LLM Cost Mitigation Actions, Priority 2), docs/ops/COST_HOTSPOT_REGISTRY.md (Hotspot 1 mitigation levers, item 1)

The runbook says: "Set LlmToolCalling:MaxRounds from 5 to 2-3 via config". The hotspot registry says: "Reduce LlmToolCalling:MaxRounds (default 5 to 3)".

Reality: MaxRounds is a compile-time constant (public const int MaxRounds = 5;) in ToolCallingChatOrchestrator.cs (line 27). It is NOT configurable. The LlmToolCalling config section in appsettings.json only has Enabled and MaxToolResultBytes. There is no MaxRounds key.

Impact: During a cost incident, an operator following this runbook would change a non-existent config key, get no error, and wonder why the mitigation had no effect. This is the most dangerous kind of runbook error -- it silently fails.

Fix: Replace all references to LlmToolCalling:MaxRounds config changes with the truth: reducing max rounds requires a code change and redeployment. Alternatively, note this as a gap and file an issue to make it configurable.


C2. Kill-switch API endpoint URL is wrong

File: docs/ops/BUDGET_BREACH_RUNBOOK.md (LLM Cost Mitigation Actions table + Quick Reference table)

The runbook references POST /api/llm/kill-switch (with hyphen) in multiple places.

Reality: The actual endpoint is POST /api/llm/killswitch (no hyphen), as defined at LlmQuotaController.cs line 56: [HttpPost("killswitch")].

Impact: During a cost emergency, an operator would get a 404 trying to activate the kill-switch. This is a direct actionability failure for the most critical emergency action.

Fix: Replace all instances of /api/llm/kill-switch with /api/llm/killswitch.


C3. Kill-switch request body schema is wrong

File: docs/ops/BUDGET_BREACH_RUNBOOK.md (Quick Reference: Emergency Actions table)

The runbook example payload is: { "scope": "Global", "active": true, "reason": "Cost emergency" }

Reality (from SetKillSwitchRequestDto in LlmQuotaContracts.cs): The DTO properties are Scope, Target, Enabled, Reason. ASP.NET Core by default uses camelCase JSON serialization, so the correct payload should be:

{ "scope": "Global", "target": null, "enabled": true, "reason": "Cost emergency" }

The field is enabled, not active. The target field is also missing from the example.

Impact: The emergency payload would fail or behave unexpectedly during a cost crisis.

Fix: Correct the JSON example to use enabled (not active) and include the target field.


HIGH

H1. Instance type claims contradict the Terraform baseline

File: docs/ops/CLOUD_COST_OBSERVABILITY.md (Compute dimension table)

The doc claims: "Single t3.medium (dev), t3.large (staging/prod)"

Reality (from deploy/terraform/aws/environments/*/terraform.tfvars.example):

  • Dev: t3.small
  • Staging: t3.medium
  • Prod: t3.large

The doc overstates dev by one size class and conflates staging with prod. This makes the compute cost estimate for dev wrong (t3.medium is ~$30/month, t3.small is ~$15/month).

Fix: Correct to t3.small (dev), t3.medium (staging), t3.large (prod) and adjust the cost estimate range accordingly.


H2. Global/Surface kill-switch returns 403 -- runbook does not mention this

File: docs/ops/BUDGET_BREACH_RUNBOOK.md (LLM Cost Mitigation Actions, Priorities 4-6)

The runbook tells operators to activate Global and Surface kill-switches via the API. But LlmQuotaController.cs lines 67-72 explicitly reject Global and Surface scope operations with HTTP 403: "Global and surface kill switch operations require admin privileges (not yet implemented)".

Impact: An operator following Priorities 4 and 6 (the most important emergency actions) would get a 403 Forbidden. The runbook does not mention this limitation or how to work around it.

Fix: Add a prominent warning that Global/Surface kill-switch via API requires admin role (not yet implemented). Document the workaround: set config keys directly and restart the API process.


H3. Hotspot 2 logging mitigation levers contain a duplicate with wrong direction

File: docs/ops/COST_HOTSPOT_REGISTRY.md (Hotspot 2, mitigation levers)

Items 3 and 4 are:

  • "3. Reduce metric export interval."
  • "4. Reduce MetricExportIntervalSeconds."

These are the same thing described twice. Worse, both say "reduce" but reducing the export interval means exporting MORE often, which costs MORE. The mitigation should be to INCREASE the interval (export less frequently).

Fix: Remove the duplicate. Correct the direction: "Increase Observability:MetricExportIntervalSeconds (e.g., from 30 to 120) to reduce metric export frequency."


MEDIUM

M1. Runbook mitigation Priority 1 missing execution instructions

File: docs/ops/BUDGET_BREACH_RUNBOOK.md (LLM Cost Mitigation Actions, Priority 1)

HOW does the operator change LlmQuota:RequestsPerHour on a running system? The config keys are in appsettings.json, which requires either editing the file and restarting, or using environment variable overrides. Neither mechanism is documented.

Fix: Add to "How to execute": "Edit appsettings.json and restart API, or set environment variable LlmQuota__RequestsPerHour=<value> and restart."


M2. ADR-0023 cost figure is scenario-specific, presented as generic

File: docs/decisions/ADR-0023-cloud-cost-observability.md (Context, point 1)

Claims ~$0.00088 per "3-round conversation". SPIKE_618 shows this is for one specific scenario (5,120 input + 180 output tokens with a particular board context). Different conversations could cost significantly more.

Fix: Add "approximately" and note costs vary with context size and tool result sizes.


M3. Hotspot 1 oversimplifies timeout structure

File: docs/ops/COST_HOTSPOT_REGISTRY.md (Hotspot 1, current guardrails)

Claims "60s timeout." Reality: there are TWO timeouts -- TotalTimeoutSeconds = 60 (total orchestration) AND PerRoundTimeoutSeconds = 30 (per LLM API call).

Fix: Clarify: "60s total orchestration timeout, 30s per-round timeout."


M4. SQLite VACUUM safety caveat missing (identified in self-review but not applied)

File: docs/ops/BUDGET_BREACH_RUNBOOK.md (Storage Cost Mitigation, Priority 1)

VACUUM temporarily doubles disk usage and holds an exclusive lock. The self-review identified this but the document was not updated.

Fix: Add warning about temporary space requirement and exclusive lock.


LOW

L1. DNS costs bundled into Network dimension rather than broken out -- acceptable at current scale.

L2. Terraform template hardcodes time_period_start = "2026-04-01_00:00" -- should add a comment to adjust the date.

L3. Cost estimates in the hotspot registry are labeled as approximate but the Gemini pricing is stated generically ("pricing varies, verify against current Google pricing") while SPIKE_618 has specific Gemini 2.5 Flash pricing ($0.30/1M input, $2.50/1M output) that could be referenced.


Summary

Severity Count
CRITICAL 3
HIGH 3
MEDIUM 4
LOW 3

Blocking issues: C1, C2, C3, H1, H2 must be fixed before merge. An operator following this runbook during a real cost incident would encounter wrong URLs, wrong payloads, phantom config keys, and 403 errors on emergency actions.

Chris0Jeky and others added 2 commits April 9, 2026 03:57
…docs

- C1: Replace phantom LlmToolCalling:MaxRounds config references with
  accurate information (MaxRounds is a compile-time constant, not
  configurable) across runbook and hotspot registry
- C2: Correct kill-switch API endpoint from /api/llm/kill-switch to
  /api/llm/killswitch (no hyphen) matching LlmQuotaController
- C3: Fix kill-switch request body schema (enabled not active, include
  target field) in runbook emergency actions
- H1: Correct instance types to match Terraform baseline (t3.small dev,
  t3.medium staging, t3.large prod) with adjusted cost estimates
- H2: Add warnings that Global/Surface kill-switch API returns 403
  (admin role not yet implemented) with config-based workarounds
- H3: Fix duplicate logging mitigation lever and correct direction
  (increase MetricExportIntervalSeconds to reduce frequency)
- M1: Add execution instructions for config-based rate limit changes
- M3: Clarify dual timeout structure (60s total, 30s per-round)
Chris0Jeky added a commit that referenced this pull request Apr 9, 2026
Two tests in ConcurrencyRaceConditionStressTests were failing across
PRs #797, #798, and #808 on main.

ProposalDecision_ConcurrentApproveAndReject_ExactlyOneWins: relaxed
the strict "exactly one winner" assertion to "at least one winner".
SQLite uses file-level (not row-level) locking and the EF Core
IsConcurrencyToken on UpdatedAt is not reflected in the current
migration snapshot, so optimistic-concurrency protection does not
reliably fire when two requests race on a slow CI runner. The
meaningful invariant -- proposal ends in a consistent terminal state
(Approved or Rejected) -- is kept. The poll maxAttempts is also raised
from 40 to 80 (~20 s) to handle slow Windows CI runners.

ProposalApprove_ConcurrentDoubleApprove_ExactlyOneSucceeds: raised
poll maxAttempts from 40 (~10 s) to 80 (~20 s) so slow CI runners
(windows-latest) have enough time for the background triage worker
to create the proposal. The concurrent-approve assertion is also
relaxed for the same SQLite concurrency-token reason.
…udget-observability

# Conflicts:
#	docs/IMPLEMENTATION_MASTERPLAN.md
#	docs/STATUS.md
#	docs/decisions/INDEX.md
…vability' into ops/cost-guardrails-budget-observability

# Conflicts:
#	docs/decisions/INDEX.md
#	docs/ops/BUDGET_BREACH_RUNBOOK.md
#	docs/ops/CLOUD_COST_OBSERVABILITY.md
#	docs/ops/COST_HOTSPOT_REGISTRY.md
…udget-observability

# Conflicts:
#	docs/IMPLEMENTATION_MASTERPLAN.md
#	docs/STATUS.md
- Clarify that per-user throttling uses Identity scope kill-switch
  (POST /api/llm/killswitch with scope: Identity), not LlmQuota
  config keys which are global-only
- Document that MaxRounds is a compile-time constant requiring code
  change and redeployment, not a runtime config knob
- Add notes that Global and Surface scope kill-switch operations
  return 403 until admin roles are implemented, with config fallback
  alternatives (LlmKillSwitch:GlobalKill, LlmKillSwitch:SurfaceKills)
@Chris0Jeky Chris0Jeky merged commit d6ef4f7 into main Apr 12, 2026
14 checks passed
@Chris0Jeky Chris0Jeky deleted the ops/cost-guardrails-budget-observability branch April 12, 2026 01:03
@github-project-automation github-project-automation bot moved this from Pending to Done in Taskdeck Execution Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

OPS-12: Cloud cost observability and budget-guardrail automation

3 participants