Skip to content

feat: add Application Insights for OTEL tracing and uat->staging rename#68

Merged
JustAGhosT merged 11 commits intodevfrom
feat/infra-alignment
Mar 15, 2026
Merged

feat: add Application Insights for OTEL tracing and uat->staging rename#68
JustAGhosT merged 11 commits intodevfrom
feat/infra-alignment

Conversation

@JustAGhosT
Copy link
Copy Markdown
Contributor

@JustAGhosT JustAGhosT commented Mar 15, 2026

Summary

  • Adds Application Insights resource for request-to-token attribution (OTEL tracing)
  • Completes uat → staging rename across all docs and scripts
  • Updates planning doc with completed Phase 1 & 2 status

Changes

Infrastructure

  • infra/modules/aigateway_aca/main.tf: Add Application Insights resource, Key Vault secret, and container app env var
  • infra/modules/aigateway_aca/outputs.tf: Add application_insights_connection_string output

Documentation

  • Complete uat→staging rename in README.md, docs/, scripts/, GitHub Actions
  • Update planning doc acceptance criteria and action items

Testing

  • Terraform syntax validated
  • Requires terraform apply to create App Insights resource
  • OTEL export works with either OTLP collector endpoint or Azure Monitor exporter

Risk

  • Low: Infrastructure additions only, no breaking changes

Summary by CodeRabbit

  • New Features

    • Application Insights integrated for OpenTelemetry export and improved observability.
    • New environment deployment workflow to standardize deploys and runtime smoke tests.
    • Extensive architecture and SLM system reference documentation added.
  • Updates

    • Renamed deployment environment from UAT to Staging across CI/CD, infra, scripts, and runtime logic.
  • Documentation

    • Many guides and READMEs added/updated to reflect Staging and new deployment guidance.
  • Chores

    • Environment validation and scripts updated to include Staging.

Renamed all infrastructure files from UAT to staging and updated all environment references in configuration files and validation rules to use "staging" instead of "uat".
Replaces all instances of "uat" with "staging" in deployment workflow and documentation, including GitHub Actions job names, environment references, and PR label triggers.
…ging rename

- Add Application Insights resource for trace storage
- Add appinsights connection string to Key Vault and Container App env
- Add OTEL configuration for azure-monitor-opentelemetry exporter
- Complete uat->staging rename across all docs and scripts
- Update planning doc with completed Phase 1 & 2 status
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR renames the UAT environment to staging across CI/CD, docs, scripts, and Terraform; adds Application Insights telemetry (resource, Key Vault secret, container app env var, and sensitive output) to the aigateway Terraform module; and adds many new architecture/SLM documentation files.

Changes

Cohort / File(s) Summary
GitHub CI/CD & Actions
/.github/actions/import-container-app/action.yml, /.github/workflows/deploy.yaml, /.github/pull_request_template.md, /.github/workflows/deploy-environment.yaml
Replace UAT → staging in inputs/labels/conditions; refactor deploy jobs to use deploy-environment.yaml; add environment-specific parameters and working dirs.
Terraform: module telemetry
infra/modules/aigateway_aca/main.tf, infra/modules/aigateway_aca/outputs.tf
Add azurerm_application_insights, create Key Vault secret for connection string, inject secret into azurerm_container_app env var, and add sensitive output exposing the connection string.
Terraform: env validation & vars
infra/env/*/variables.tf, infra/env/staging/terraform.tfvars
Change allowed env values and validation messages from uatstaging; update staging tfvars and tags.env.
Terraform: module variable updates
infra/modules/.../variables.tf
infra/modules/dashboard_aca/variables.tf, infra/modules/state_service_aca/variables.tf, infra/modules/aigateway_aca/variables.tf
Update env variable descriptions, validation lists, and error messages to include staging.
Scripts & tooling
infra/scripts/terraform-init.ps1, infra/scripts/terraform-init.sh, scripts/add-federated-credentials.sh, scripts/bootstrap.ps1, scripts/bootstrap.sh
Update usage text, validation sets, loops, subject examples, and messages to use staging instead of uat; small exit/usage behavior tweak in add-federated-credentials.
Application / UI
dashboard/app.js
Adjust deriveEnv regex/pattern to recognize staging as a valid environment.
Docs: environment naming & CI guidance
README.md, docs/AZURE_OIDC_SETUP.md, docs/CI_CD.md, docs/PRD.md, docs/SECRETS.md, docs/Terraform_Blueprint.md
Replace uatstaging across examples, secrets guidance, PR label docs, and Terraform path references; update PR label guidance to run-staging.
Docs: planning & operational
docs/planning/request_to_token_attribution.md
Mark Phase 2 done and update acceptance criteria, dependencies, and action items to reflect Application Insights / OTEL decisions.
Docs: architecture & guides (new files)
docs/architecture/**, docs/guides/**, docs/architecture/reference/**, docs/architecture/systems/**
Add many new architecture, SLM, reference, strategic, and system guides (multiple new markdown files covering SLM roles, contracts, telemetry, deployment, and operational patterns).
New workflow
/.github/workflows/deploy-environment.yaml
Introduce centralized environment deployment workflow (inputs, secret checks, terraform init/import/plan/apply, runtime diagnostics, AOAI integration test, gateway smoke tests).

Sequence Diagram(s)

sequenceDiagram
    participant CI as CI (workflow)
    participant ARM as Azure Resource Manager
    participant KV as Key Vault
    participant CA as Container App
    participant App as Application

    rect rgba(200,200,255,0.5)
    CI->>ARM: create azurerm_application_insights
    ARM-->>CI: return connection_string
    CI->>KV: create secret(appinsights_connection_string)
    KV-->>CI: confirm secret
    CI->>ARM: create/update Container App referencing Key Vault secret
    ARM-->>CA: deploy Container App with secret mapping
    end

    rect rgba(200,255,200,0.5)
    CA->>App: container starts with APPLICATIONINSIGHTS_CONNECTION_STRING env
    App->>ARM: send telemetry (OTEL → Application Insights)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 I hopped from UAT to staging bright,
Docs and scripts aligned beneath moonlight,
Insights now listen, spans take flight,
Secrets tucked safe; containers hum at night,
A rabbit cheers — telemetry feels right.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description covers the Summary, Changes, Testing, and Risk sections. However, it does not follow the required template structure with the Validation, Deployment Notes, and UAT Toggle sections. Update the description to follow the template format, including Validation checklist items, Deployment Notes, and the updated Staging Toggle section (replacing UAT Toggle with run-staging label guidance).
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the two main changes: adding Application Insights for OTEL tracing and renaming uat to staging across the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/infra-alignment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
README.md (1)

107-107: ⚠️ Potential issue | 🟡 Minor

Inconsistent reference to "UAT" in documentation link.

Line 107 still references "UAT toggle" while the rest of the PR renames UAT to staging. This appears to be an oversight in the documentation description.

-- [CI/CD Runbook](docs/CI_CD.md) – workflow behavior, UAT toggle, smoke tests
+- [CI/CD Runbook](docs/CI_CD.md) – workflow behavior, staging toggle, smoke tests
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 107, Update the README line that reads "[CI/CD
Runbook](docs/CI_CD.md) – workflow behavior, UAT toggle, smoke tests" to use the
new terminology "staging" instead of "UAT" (e.g., "...workflow behavior, staging
toggle, smoke tests") so it matches the rest of the PR; ensure the phrase
appears exactly in the README and the linked docs if there are other occurrences
of "UAT" to rename consistently.
docs/PRD.md (2)

170-170: ⚠️ Potential issue | 🟡 Minor

Incomplete uat→staging rename in acceptance criteria.

Line 170 still references "Dev/UAT/Prod" while the rest of the document has been updated to use "staging". This creates an inconsistency.

📝 Proposed fix
-3.  Dev/UAT/Prod are reproducible via Terraform + Actions.
+3.  Dev/Staging/Prod are reproducible via Terraform + Actions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/PRD.md` at line 170, Replace the remaining "Dev/UAT/Prod" phrasing in
the acceptance criteria with the consistent "Dev/staging/Prod" term
(specifically update the item containing "Dev/UAT/Prod" to "Dev/staging/Prod");
also scan the PRD.md for any other occurrences of "UAT" or "UAT→staging" and
update them to "staging" so the document uses a single environment name
consistently.

183-183: ⚠️ Potential issue | 🟡 Minor

Incomplete uat→staging rename in milestones.

Line 183 still references "UAT" in the milestone description while the rest of the document uses "staging".

📝 Proposed fix
-- M2: UAT + Prod; environment approvals.
+- M2: Staging + Prod; environment approvals.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/PRD.md` at line 183, The milestone text "M2: UAT + Prod; environment
approvals." is inconsistent with the rest of the doc that uses "staging"—update
the M2 milestone (look for the "M2" milestone heading/text) to read "M2: staging
+ Prod; environment approvals." and scan for any other lingering "UAT"
references to replace them with "staging" to maintain consistency.
🧹 Nitpick comments (1)
docs/planning/request_to_token_attribution.md (1)

219-221: Minor: Consider tense consistency in dependencies.

Line 221 states "Application Insights added for trace storage" using past tense, but this is the PR adding Application Insights according to the objectives. While this might be intentional (documenting the post-merge state), consider using present/future tense for clarity during review: "Application Insights being added for trace storage" or "Requires Application Insights for trace storage".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/planning/request_to_token_attribution.md` around lines 219 - 221, Change
the tense for the infra dependency line "infra: Application Insights added for
trace storage" to present/future tense to reflect that this PR is adding it; for
example update the list item to "infra: Application Insights being added for
trace storage" or "infra: Requires Application Insights for trace storage" so
the dependency wording is consistent with the other lines like "cognitive-mesh:
Must pass correlation metadata to gateway" and "pvc-costops-analytics: Must
create KQL queries for new event shape".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/deploy.yaml:
- Line 38: The comment on the deploy workflow is inconsistent: update the
comment that currently reads "Runtime UAT toggle: add PR label 'run-staging' to
enable UAT on PRs into main." so it references "run-staging" terminology; change
the phrase "Runtime UAT toggle" to language that matches the label (e.g.,
"Runtime staging toggle") so the comment aligns with the `run-staging` label and
the purpose described in the workflow.

In `@docs/planning/request_to_token_attribution.md`:
- Around line 210-215: The acceptance table incorrectly marks "100% include
workflow + stage" and "Support KQL joins by operation_Id/request_id" as ✅ Done
while downstream tasks show dependencies still outstanding; update the table
rows for "100% include workflow + stage" and "Support KQL joins by
operation_Id/request_id" to reflect the real state by either (A) changing the
Status cell to "🔜 Ready" or "⏳ In Progress" until cognitive-mesh begins passing
correlation metadata and pvc-costops-analytics finishes KQL queries, or (B)
rewording the criterion label to "Capability Available: workflow+stage metadata"
/ "Capability Available: KQL join support" to distinguish capability from
feature completion; ensure you reference the dependency names cognitive-mesh and
pvc-costops-analytics in the Notes column so the relationship is clear.

---

Outside diff comments:
In `@docs/PRD.md`:
- Line 170: Replace the remaining "Dev/UAT/Prod" phrasing in the acceptance
criteria with the consistent "Dev/staging/Prod" term (specifically update the
item containing "Dev/UAT/Prod" to "Dev/staging/Prod"); also scan the PRD.md for
any other occurrences of "UAT" or "UAT→staging" and update them to "staging" so
the document uses a single environment name consistently.
- Line 183: The milestone text "M2: UAT + Prod; environment approvals." is
inconsistent with the rest of the doc that uses "staging"—update the M2
milestone (look for the "M2" milestone heading/text) to read "M2: staging +
Prod; environment approvals." and scan for any other lingering "UAT" references
to replace them with "staging" to maintain consistency.

In `@README.md`:
- Line 107: Update the README line that reads "[CI/CD Runbook](docs/CI_CD.md) –
workflow behavior, UAT toggle, smoke tests" to use the new terminology "staging"
instead of "UAT" (e.g., "...workflow behavior, staging toggle, smoke tests") so
it matches the rest of the PR; ensure the phrase appears exactly in the README
and the linked docs if there are other occurrences of "UAT" to rename
consistently.

---

Nitpick comments:
In `@docs/planning/request_to_token_attribution.md`:
- Around line 219-221: Change the tense for the infra dependency line "infra:
Application Insights added for trace storage" to present/future tense to reflect
that this PR is adding it; for example update the list item to "infra:
Application Insights being added for trace storage" or "infra: Requires
Application Insights for trace storage" so the dependency wording is consistent
with the other lines like "cognitive-mesh: Must pass correlation metadata to
gateway" and "pvc-costops-analytics: Must create KQL queries for new event
shape".

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bb682266-666a-4fa8-8634-47a25a4eacc6

📥 Commits

Reviewing files that changed from the base of the PR and between 7fb9fe9 and 4f315b2.

📒 Files selected for processing (27)
  • .github/actions/import-container-app/action.yml
  • .github/pull_request_template.md
  • .github/workflows/deploy.yaml
  • README.md
  • dashboard/app.js
  • docs/AZURE_OIDC_SETUP.md
  • docs/CI_CD.md
  • docs/PRD.md
  • docs/SECRETS.md
  • docs/Terraform_Blueprint.md
  • docs/planning/request_to_token_attribution.md
  • infra/env/dev/variables.tf
  • infra/env/prod/variables.tf
  • infra/env/staging/.terraform.lock.hcl
  • infra/env/staging/main.tf
  • infra/env/staging/terraform.tfvars
  • infra/env/staging/variables.tf
  • infra/modules/aigateway_aca/main.tf
  • infra/modules/aigateway_aca/outputs.tf
  • infra/modules/aigateway_aca/variables.tf
  • infra/modules/dashboard_aca/variables.tf
  • infra/modules/state_service_aca/variables.tf
  • infra/scripts/terraform-init.ps1
  • infra/scripts/terraform-init.sh
  • scripts/add-federated-credentials.sh
  • scripts/bootstrap.ps1
  • scripts/bootstrap.sh

- deploy.yaml: fix 'Runtime UAT toggle' comment to 'Runtime staging toggle'
- request_to_token_attribution.md: update acceptance criteria status to
  'Ready' with dependency notes, fix tense on infra dependency
- PRD.md: replace Dev/UAT/Prod with Dev/staging/Prod, fix M2 milestone
- README.md: fix 'UAT toggle' to 'staging toggle'
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
.github/workflows/deploy.yaml (1)

327-338: Consider extracting deploy job boilerplate into a reusable workflow/composite action.

Adding deploy-staging increases duplication with deploy-dev/deploy-prod and raises drift risk for future edits (quickcheck, login, init/plan/apply, smoke tests).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/deploy.yaml around lines 327 - 338, The deploy-staging job
duplicates logic present in deploy-dev/deploy-prod; extract the shared steps
(quickcheck, login, init/plan/apply, smoke tests) into a reusable workflow or
composite action and call it from the deploy-staging, deploy-dev, and
deploy-prod jobs; create a new reusable workflow/composite (e.g.,
deploy-workflow or deploy-composite) that accepts inputs like TF_VAR_env and PR
label conditions, move the common run steps there, and update the deploy-staging
job (and the other deploy-* jobs) to invoke that reusable workflow via uses: or
workflow_call so only environment-specific conditionals (name, environment,
env.TF_VAR_env, if condition) remain in each job.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/planning/request_to_token_attribution.md`:
- Around line 233-235: The doc uses two conflicting phrases — "pass correlation
metadata in request body" and "must pass correlation headers" — causing
ambiguity; choose one canonical requirement (preferably "pass correlation
metadata in request body") and update every occurrence of the other phrase so
the integration requirement for cognitive-mesh is consistent, including the
pending item line that currently lists "Pass correlation metadata in request
body" and the earlier sentence that currently reads "must pass correlation
headers", plus any related sections or examples that describe the required
integration mode (search for those exact strings to find and replace).
- Line 221: Replace the claim "Application Insights being added for trace
storage" with the clarified sentence: "Application Insights resource +
connection string wiring added; actual trace export requires custom LiteLLM
image deployment or explicit OTLP collector configuration." Also ensure the note
references the existing symbols/vars: APPLICATIONINSIGHTS_CONNECTION_STRING, the
comment about the custom LiteLLM image including azure-monitor-opentelemetry,
and the default empty OTEL_EXPORTER_OTLP_ENDPOINT so readers see that trace
export is not yet operational without those pieces.

---

Nitpick comments:
In @.github/workflows/deploy.yaml:
- Around line 327-338: The deploy-staging job duplicates logic present in
deploy-dev/deploy-prod; extract the shared steps (quickcheck, login,
init/plan/apply, smoke tests) into a reusable workflow or composite action and
call it from the deploy-staging, deploy-dev, and deploy-prod jobs; create a new
reusable workflow/composite (e.g., deploy-workflow or deploy-composite) that
accepts inputs like TF_VAR_env and PR label conditions, move the common run
steps there, and update the deploy-staging job (and the other deploy-* jobs) to
invoke that reusable workflow via uses: or workflow_call so only
environment-specific conditionals (name, environment, env.TF_VAR_env, if
condition) remain in each job.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 615a4d0d-704f-4f4e-a5d2-153645401c33

📥 Commits

Reviewing files that changed from the base of the PR and between 4f315b2 and eb07474.

📒 Files selected for processing (4)
  • .github/workflows/deploy.yaml
  • README.md
  • docs/PRD.md
  • docs/planning/request_to_token_attribution.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/PRD.md

- docs/architecture/README.md: SLM characteristics, patterns, advantages, limitations
- docs/guides/README.md: Practical guidance on when to use SLM vs LLM, implementation patterns
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
docs/guides/README.md (1)

68-68: Consider minor rewording for style.

The phrase "In many cases" is commonly used. Consider alternatives like "Often" or "Frequently" for variety.

✏️ Optional style improvement
-In many cases an SLM + RAG outperforms a large model with no context.
+Often an SLM + RAG outperforms a large model with no context.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/guides/README.md` at line 68, Replace the sentence "In many cases an SLM
+ RAG outperforms a large model with no context." with a tighter phrasing using
an alternative to "In many cases" (e.g., "Often, an SLM + RAG outperforms a
large model with no context." or "Frequently, an SLM + RAG outperforms a large
model with no context."), preserving the "SLM + RAG" wording and punctuation;
ensure spacing and comma after the adverb are correct and that the rest of the
sentence remains unchanged.
docs/architecture/README.md (1)

30-75: Add language specifiers to fenced code blocks for linting compliance.

The ASCII diagrams in fenced code blocks should specify a language (or 'text') to satisfy markdown linting rules.

🔧 Proposed fix for linting compliance

For the Cascade Architecture diagram (lines 30-38):

-```
+```text
 SLM
  ↓ confidence high
 Return result

Apply the same pattern to other ASCII diagrams at:

  • Lines 44-50 (Router + Specialists)
  • Lines 56-61 (Local-First AI)
  • Lines 67-75 (Typical Modern AI Stack)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/README.md` around lines 30 - 75, The fenced ASCII diagrams
(the blocks containing "SLM" / "↓ confidence high" / "Return result", the
"Router (SLM)" block with "Code model" etc., the "Device" block with "SLM" and
"embeddings", and the "User Request" -> "Router (SLM)" diagram) need explicit
language specifiers for markdown linting; update each triple-backtick fence that
currently starts with ``` to use ```text (e.g., change ``` to ```text) for the
Cascade Architecture ("SLM"/"↓ confidence high"), Router + Specialists ("Router
(SLM)"), Local-First AI ("Device"), and Typical Modern AI Stack ("User
Request"/"Router (SLM)") blocks so linters accept the files.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docs/architecture/README.md`:
- Around line 30-75: The fenced ASCII diagrams (the blocks containing "SLM" / "↓
confidence high" / "Return result", the "Router (SLM)" block with "Code model"
etc., the "Device" block with "SLM" and "embeddings", and the "User Request" ->
"Router (SLM)" diagram) need explicit language specifiers for markdown linting;
update each triple-backtick fence that currently starts with ``` to use ```text
(e.g., change ``` to ```text) for the Cascade Architecture ("SLM"/"↓ confidence
high"), Router + Specialists ("Router (SLM)"), Local-First AI ("Device"), and
Typical Modern AI Stack ("User Request"/"Router (SLM)") blocks so linters accept
the files.

In `@docs/guides/README.md`:
- Line 68: Replace the sentence "In many cases an SLM + RAG outperforms a large
model with no context." with a tighter phrasing using an alternative to "In many
cases" (e.g., "Often, an SLM + RAG outperforms a large model with no context."
or "Frequently, an SLM + RAG outperforms a large model with no context."),
preserving the "SLM + RAG" wording and punctuation; ensure spacing and comma
after the adverb are correct and that the rest of the sentence remains
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b54ca6a9-a1cb-453c-9cec-f60b1de83b9d

📥 Commits

Reviewing files that changed from the base of the PR and between eb07474 and 9d5b99b.

📒 Files selected for processing (2)
  • docs/architecture/README.md
  • docs/guides/README.md

…nt plan

- docs/architecture/: Add README, ai-gateway, cognitive-mesh,
  phoenix-rooivalk, codeflow-engine, agentkit-forge, cross-system,
  and slm-management-plan
- docs/guides/: Update index
- Remove empty infra/env/uat directory
- ai-gateway.md: SLM as admission control and routing brain
- cognitive-mesh.md: SLM as router, decompressor, state compressor
- codeflow-engine.md: SLM as CI/CD triage
- agentkit-forge.md: SLM as tool selector and execution governor
- phoenix-rooivalk.md: SLM for reports only, NOT safety-critical
- cross-system.md: Combined architecture with tiered model strategy
- slm-management-plan.md: Updated with canonical principles
- guides/README.md: Updated with practical patterns
…itecture documentation.

This PR makes significant improvements to the architecture documentation by:

1. Creating a better organizational structure with separate folders for:
   - `systems/` - Individual system documentation
   - `reference/` - Detailed implementation matrices and technical specifications
   - `reference/strategic/` - Strategic guidance documents

2. Adding comprehensive documentation about SLM (Small Language Model) implementation across all systems:
   - Detailed implementation matrices for each system
   - Confidence threshold guidelines
   - Telemetry schemas
   - Contract shapes and API endpoints

3. Adding new architectural diagrams:
   - Deployment architecture
   - Trust boundaries
   - Observability architecture
   - Cross-system integration patterns

4. Adding documentation for a new system called "Mystira" (interactive story generation system)

5. Providing strategic guidance on SLM implementation order and best practices

The PR significantly improves the documentation by providing practical implementation details while maintaining a consistent architectural vision across all systems. The reorganization makes the documentation more navigable and the new reference materials provide clear implementation guidance.

This PR should be approved as it greatly enhances the architecture documentation with practical implementation details.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Nitpick comments (9)
docs/architecture/systems/cognitive-mesh.md (1)

146-146: Consider tightening wording in the tradeoff table.

Line 146 uses “Very large token savings”; “Large token savings” reads cleaner and more precise for architecture docs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/systems/cognitive-mesh.md` at line 146, Update the tradeoff
table entry that currently reads "Very large token savings" to "Large token
savings" to tighten wording; locate the table row containing the phrase "Very
large token savings | Decomposition quality can bottleneck workflow" in
cognitive-mesh.md and replace the left-side cell text only, leaving the rest of
the row unchanged.
docs/architecture/systems/ai-gateway.md (2)

7-24: Add language specifier to fenced code block.

The ASCII architecture diagram should have a language specifier. Consider using text or plaintext to satisfy markdown linting rules.

📝 Proposed fix
-```
+```text
 Client Request
       │
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/systems/ai-gateway.md` around lines 7 - 24, The fenced
ASCII diagram block in ai-gateway.md is missing a language specifier; update the
opening triple-backtick that precedes the diagram (the block containing "Client
Request" and "SLM Control Layer") to include a language tag such as text or
plaintext (e.g., change ``` to ```text) so the diagram satisfies markdown
linting rules and is treated as plain text.

1-146: Consider documenting v1 API endpoint routing.

Based on learnings, the gateway must route /v1/responses and /v1/embeddings endpoints to Azure OpenAI using LiteLLM. The current documentation focuses on SLM admission control but doesn't mention these specific API endpoints.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/systems/ai-gateway.md` around lines 1 - 146, The docs miss
explicit v1 API routing rules; update the architecture doc to state that
incoming /v1/responses and /v1/embeddings requests must be routed to Azure
OpenAI via the LiteLLM provider. Edit the "Routing Logic" section (update
gateway_admission and/or add a short route_to_provider example) to show mapping
of those endpoints to Azure OpenAI/LiteLLM, add a note in the "Implementation
Checklist" to configure LiteLLM Azure provider for those endpoints, and include
a short example/rule describing any differences in handling responses vs
embeddings (e.g., batching, model selection).
docs/architecture/systems/phoenix-rooivalk.md (1)

7-27: Add language specifier to fenced code block.

The ASCII architecture diagram should have a language specifier such as text to satisfy markdown linting rules.

📝 Proposed fix
-```
+```text
 Sensors
   │
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/systems/phoenix-rooivalk.md` around lines 7 - 27, The
fenced ASCII diagram currently lacks a language specifier; update the Markdown
fenced code block that surrounds the ASCII architecture diagram (the block
starting with ``` and containing "Sensors", "Rules + Signal Models + Fusion",
"SLM Interpretation Layer", etc.) to include a language tag (e.g., add `text`
after the opening backticks as ```text) so markdown linting rules are satisfied.
docs/architecture/reference/slm-implementation-matrix.md (2)

18-27: Add language specifier to fenced code block.

The directory structure should have a language specifier. Consider using text or plaintext.

📝 Proposed fix
-```
+```text
 reference/
 ├── slm-implementation-matrix.md      # This file
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/reference/slm-implementation-matrix.md` around lines 18 -
27, The fenced code block in slm-implementation-matrix.md lacks a language
specifier which tools rely on for proper rendering; update the opening fence for
the directory tree in the file (the block starting with ```) to include a
language label such as text or plaintext so it reads e.g. ```text, ensuring the
directory structure snippet renders correctly.

254-260: Add language specifier to fenced code block.

The fallback pattern list should have a language specifier. Consider using text or plaintext.

📝 Proposed fix
-```
+```text
 1. SLM timeout → Deterministic rules
 2. Low confidence → LLM escalation
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/reference/slm-implementation-matrix.md` around lines 254 -
260, The fenced code block containing the fallback pattern list should include a
language specifier so syntax highlighters render it as plain text; update the
triple-backtick fence around the list (the block starting with "1. SLM timeout →
Deterministic rules") to use a language like "text" or "plaintext" (e.g.,
```text) and keep the content unchanged.
docs/architecture/04-observability-telemetry.md (2)

37-37: Clarify Application Insights vs Azure Monitor terminology.

The documentation uses "Azure Monitor" generically, but the actual infrastructure provisions an Azure Application Insights resource and exposes APPLICATIONINSIGHTS_CONNECTION_STRING as the environment variable. Based on learnings from infra/modules/aigateway_aca/main.tf:336-345, the OTEL exporter specifically targets Application Insights.

Consider being more specific: "Application Insights" instead of "Azure Monitor" for clarity, or add a note explaining that Application Insights is the Azure Monitor implementation used.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/04-observability-telemetry.md` at line 37, The doc uses
"Azure Monitor" generically (see node label I2[Azure Monitor]); update wording
to specifically say "Application Insights" where the deployed resource and OTEL
exporter target Application Insights, or add a clarifying note that Application
Insights is the Azure Monitor implementation used and that the OTEL exporter
uses the APPLICATIONINSIGHTS_CONNECTION_STRING env var; ensure references to the
OTEL exporter and the environment variable
(APPLICATIONINSIGHTS_CONNECTION_STRING) and the infrastructure intent (OTEL ->
Application Insights) are explicit.

1-94: Consider documenting retention policies.

The Application Insights resource has environment-specific retention (90 days for prod, 30 days for others) as shown in infra/modules/aigateway_aca/main.tf:239-247. This is an operational consideration that could benefit observability operators. Based on learnings from infra/modules/aigateway_aca/main.tf:239-247.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/04-observability-telemetry.md` around lines 1 - 94, Add a
short "Retention policies" subsection under "Telemetry Architecture" documenting
the Application Insights retention defaults used by the infra: production
retention 90 days and non-production retention 30 days, note that these are
environment-specific and where operators should look to change them (the
Application Insights resource configuration), and recommend including this
retention info in operational runbooks so observability operators can align
expectations and costs; reference the "Application Insights" resource and the
"Telemetry Architecture" section to locate where to add the note.
docs/architecture/reference/strategic/07-deployment-model.md (1)

29-34: Consider aligning CodeFlow matrix wording with sibling reference docs.

“PR triage, log analysis” is accurate but less specific than the established terms used elsewhere (e.g., PR classification, CI failure triage, release-note extraction). Tightening terminology here will reduce cross-doc ambiguity.

Proposed wording tweak
-| CodeFlow        | PR triage, log analysis    | Root cause across dependencies |
+| CodeFlow        | PR classification, CI failure triage | Root cause across dependencies |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/reference/strategic/07-deployment-model.md` around lines 29
- 34, Update the CodeFlow "Best SLM Jobs" cell to use the more specific
terminology used across sibling docs: replace the phrase "PR triage, log
analysis" in the CodeFlow table row with targeted terms such as "PR
classification, CI failure triage, release-note extraction" so the matrix aligns
with other references and reduces ambiguity when searching for CodeFlow
capabilities.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/architecture/02-container-architecture.md`:
- Around line 19-20: The diagram declares the node C3[GitHub Webhooks] but never
connects it into the flow; add an edge from C3 to the appropriate downstream
node (e.g., CodeFlow or whichever node represents ingress/CI processing) so the
webhook is part of the flow. Update the same missing connections in the later
block (the repeated section around lines 58-83) so both occurrences of C3 are
connected (for example: C3 --> CodeFlow or C3 --> <IngressNodeName>), ensuring
the diagram reflects actual webhook ingress behavior.

In `@docs/architecture/04-observability-telemetry.md`:
- Around line 35-39: The Ingest subgraph is missing Prometheus as a telemetry
destination; update the diagram block (the Ingest subgraph containing
I1/OpenTelemetry, I2/Azure Monitor, I3/Blob Export) to add a fourth node for
Prometheus (e.g., I4[Prometheus]) and add the flow from the application metrics
endpoint (/metrics) into that node; also update the surrounding text to mention
that LiteLLM enables Prometheus callbacks (success_callback/failure_callback
include "prometheus") and reference the
infra/modules/aigateway_aca/main.tf:95-113 behavior so readers know Prometheus
is a primary telemetry sink alongside OTEL, Azure Monitor and Blob Export.

In `@docs/architecture/reference/matrix-gateway.md`:
- Around line 91-95: Replace the inconsistent confidence threshold text so the
doc uses the same policy values from the gateway SLM guide (0.90 / 0.75): update
the table row currently reading "confidence < 0.70" to the correct boundary
(e.g., "confidence < 0.75"), and ensure all other mentions of confidence-based
routing (the lines referencing escalation to LLM or fallback behavior) use the
0.90 / 0.75 thresholds consistently.
- Around line 42-52: The example response JSON does not match the declared
contract type ClassifyRequestOutput; update the example payload to use the
contract's field names (label and recommended_tier) and remove or rename
intent/recommended_target/recommended_model_tier accordingly so the example
reflects ClassifyRequestOutput's shape; ensure any referenced values (e.g.,
target engine or tier) are mapped into recommended_tier or documented separately
and apply the same fix to the other example block at lines 57-74.

In `@docs/architecture/reference/matrix-rooivalk.md`:
- Around line 110-113: The DEFAULT_THRESHOLDS constant contains
sop_suggestion.direct_suggest set to 0.78 which conflicts with the documented
threshold table stating top-tier behavior is >= 0.80; update the
DEFAULT_THRESHOLDS (symbol DEFAULT_THRESHOLDS) entry for
sop_suggestion.direct_suggest to 0.80 and verify any other related keys
(operator_summary, sop_suggestion.manual_lookup) and the summary table values
are consistent so the code constant and documentation table match.
- Around line 27-44: The fenced safety ASCII-art block in matrix-rooivalk.md is
unlabeled, triggering MD040; update the block fence to include a language hint
(e.g., add "text" after the opening ```) so the safety boundary block becomes a
labeled fenced code block and stops triggering the lint rule while preserving
the ASCII-art content.

In `@docs/architecture/reference/slm-management-plan.md`:
- Around line 43-56: The Markdown file contains unlabeled fenced code blocks
(e.g., the "Cost Control Layers" ASCII-art block, the "Request" ASCII-art block,
and the "Discovery → Testing → Staging → Production → Deprecated → Retired"
block) which trigger MD040; update each fenced block by adding a language label
(use "text") to the opening triple-backticks so the blocks become fenced with
```text to satisfy markdown linting and ensure consistent rendering.
- Around line 266-274: The checklist contains a duplicate/completed item "Add
explicit safety boundary for PhoenixRooivalk" in the tail checklist; remove that
line or change it to a follow-up task (e.g., "Verify existing PhoenixRooivalk
safety boundary and close gaps") so it does not appear as an unchecked action;
check for and reconcile with the existing safety-boundary entries already
documented in the earlier "safety boundary" section and the other mention so the
checklist reflects only outstanding work.

In `@docs/architecture/systems/agentkit-forge.md`:
- Around line 7-24: The code block in the architecture diagram is missing a
fence language which triggers markdownlint MD040; update the opening
triple-backtick for the ASCII diagram to include a language (e.g., change ``` to
```text or ```mermaid if you convert to mermaid) so the fenced block around the
diagram (the block that begins with the three backticks and contains "Agent Task
│ ▼ ... └─→ LLM Synthesis") specifies a language and satisfies the linter.

In `@docs/architecture/systems/codeflow-engine.md`:
- Around line 7-24: The fenced architecture diagram block is missing a language
tag; update the triple-backtick fence that opens the diagram (the ASCII flow
block shown under "Git Push / PR Event") to use the language identifier `text`
(i.e., change ``` to ```text) so markdownlint MD040 is satisfied while
preserving the diagram content.
- Around line 94-100: The f-string assigned to variable prompt embeds literal
JSON braces which must be escaped; in the prompt construction (variable prompt
using change_type and impacted_files) double the outer curly braces around the
JSON-like object (e.g., replace { "run_unit": ... } with {{ "run_unit": ... }})
so Python treats them as literal braces in the f-string rather than expression
delimiters.

In `@docs/architecture/systems/cognitive-mesh.md`:
- Around line 7-29: The fenced ASCII diagram block starting with ``` should
include a language identifier to satisfy markdown lint MD040; edit the block
that begins with the triple backticks before "User Query" and change it to start
with ```text so the diagram (the SLM Control Fabric / Routing Decision / Code
Agent / Research Agent section) is fenced as ```text.

---

Nitpick comments:
In `@docs/architecture/04-observability-telemetry.md`:
- Line 37: The doc uses "Azure Monitor" generically (see node label I2[Azure
Monitor]); update wording to specifically say "Application Insights" where the
deployed resource and OTEL exporter target Application Insights, or add a
clarifying note that Application Insights is the Azure Monitor implementation
used and that the OTEL exporter uses the APPLICATIONINSIGHTS_CONNECTION_STRING
env var; ensure references to the OTEL exporter and the environment variable
(APPLICATIONINSIGHTS_CONNECTION_STRING) and the infrastructure intent (OTEL ->
Application Insights) are explicit.
- Around line 1-94: Add a short "Retention policies" subsection under "Telemetry
Architecture" documenting the Application Insights retention defaults used by
the infra: production retention 90 days and non-production retention 30 days,
note that these are environment-specific and where operators should look to
change them (the Application Insights resource configuration), and recommend
including this retention info in operational runbooks so observability operators
can align expectations and costs; reference the "Application Insights" resource
and the "Telemetry Architecture" section to locate where to add the note.

In `@docs/architecture/reference/slm-implementation-matrix.md`:
- Around line 18-27: The fenced code block in slm-implementation-matrix.md lacks
a language specifier which tools rely on for proper rendering; update the
opening fence for the directory tree in the file (the block starting with ```)
to include a language label such as text or plaintext so it reads e.g. ```text,
ensuring the directory structure snippet renders correctly.
- Around line 254-260: The fenced code block containing the fallback pattern
list should include a language specifier so syntax highlighters render it as
plain text; update the triple-backtick fence around the list (the block starting
with "1. SLM timeout → Deterministic rules") to use a language like "text" or
"plaintext" (e.g., ```text) and keep the content unchanged.

In `@docs/architecture/reference/strategic/07-deployment-model.md`:
- Around line 29-34: Update the CodeFlow "Best SLM Jobs" cell to use the more
specific terminology used across sibling docs: replace the phrase "PR triage,
log analysis" in the CodeFlow table row with targeted terms such as "PR
classification, CI failure triage, release-note extraction" so the matrix aligns
with other references and reduces ambiguity when searching for CodeFlow
capabilities.

In `@docs/architecture/systems/ai-gateway.md`:
- Around line 7-24: The fenced ASCII diagram block in ai-gateway.md is missing a
language specifier; update the opening triple-backtick that precedes the diagram
(the block containing "Client Request" and "SLM Control Layer") to include a
language tag such as text or plaintext (e.g., change ``` to ```text) so the
diagram satisfies markdown linting rules and is treated as plain text.
- Around line 1-146: The docs miss explicit v1 API routing rules; update the
architecture doc to state that incoming /v1/responses and /v1/embeddings
requests must be routed to Azure OpenAI via the LiteLLM provider. Edit the
"Routing Logic" section (update gateway_admission and/or add a short
route_to_provider example) to show mapping of those endpoints to Azure
OpenAI/LiteLLM, add a note in the "Implementation Checklist" to configure
LiteLLM Azure provider for those endpoints, and include a short example/rule
describing any differences in handling responses vs embeddings (e.g., batching,
model selection).

In `@docs/architecture/systems/cognitive-mesh.md`:
- Line 146: Update the tradeoff table entry that currently reads "Very large
token savings" to "Large token savings" to tighten wording; locate the table row
containing the phrase "Very large token savings | Decomposition quality can
bottleneck workflow" in cognitive-mesh.md and replace the left-side cell text
only, leaving the rest of the row unchanged.

In `@docs/architecture/systems/phoenix-rooivalk.md`:
- Around line 7-27: The fenced ASCII diagram currently lacks a language
specifier; update the Markdown fenced code block that surrounds the ASCII
architecture diagram (the block starting with ``` and containing "Sensors",
"Rules + Signal Models + Fusion", "SLM Interpretation Layer", etc.) to include a
language tag (e.g., add `text` after the opening backticks as ```text) so
markdown linting rules are satisfied.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e78eaef1-6814-4002-94c9-403bcb72c4a8

📥 Commits

Reviewing files that changed from the base of the PR and between 9d5b99b and c7d997a.

📒 Files selected for processing (38)
  • docs/architecture/01-system-context.md
  • docs/architecture/02-container-architecture.md
  • docs/architecture/03-deployment-trust-boundaries.md
  • docs/architecture/04-observability-telemetry.md
  • docs/architecture/05-slm-llm-decision-flow.md
  • docs/architecture/06-shared-contracts.md
  • docs/architecture/07-repo-ownership-map.md
  • docs/architecture/README.md
  • docs/architecture/reference/c4-architecture.md
  • docs/architecture/reference/contracts.md
  • docs/architecture/reference/cross-system.md
  • docs/architecture/reference/dashboards.md
  • docs/architecture/reference/deployment-observability.md
  • docs/architecture/reference/matrix-agentkit.md
  • docs/architecture/reference/matrix-codeflow.md
  • docs/architecture/reference/matrix-cognitive-mesh.md
  • docs/architecture/reference/matrix-gateway.md
  • docs/architecture/reference/matrix-mystira.md
  • docs/architecture/reference/matrix-rooivalk.md
  • docs/architecture/reference/operations-patterns.md
  • docs/architecture/reference/slm-implementation-matrix.md
  • docs/architecture/reference/slm-management-plan.md
  • docs/architecture/reference/strategic/01-why-slms-matter.md
  • docs/architecture/reference/strategic/02-gateway-slm-use-cases.md
  • docs/architecture/reference/strategic/03-cognitive-mesh-use-cases.md
  • docs/architecture/reference/strategic/04-codeflow-use-cases.md
  • docs/architecture/reference/strategic/05-agentkit-use-cases.md
  • docs/architecture/reference/strategic/06-rooivalk-use-cases.md
  • docs/architecture/reference/strategic/07-deployment-model.md
  • docs/architecture/reference/strategic/08-implementation-order.md
  • docs/architecture/reference/strategic/README.md
  • docs/architecture/systems/agentkit-forge.md
  • docs/architecture/systems/ai-gateway.md
  • docs/architecture/systems/codeflow-engine.md
  • docs/architecture/systems/cognitive-mesh.md
  • docs/architecture/systems/mystira.md
  • docs/architecture/systems/phoenix-rooivalk.md
  • docs/guides/README.md
✅ Files skipped from review due to trivial changes (13)
  • docs/architecture/06-shared-contracts.md
  • docs/architecture/05-slm-llm-decision-flow.md
  • docs/architecture/07-repo-ownership-map.md
  • docs/architecture/reference/matrix-mystira.md
  • docs/architecture/reference/strategic/03-cognitive-mesh-use-cases.md
  • docs/architecture/reference/strategic/01-why-slms-matter.md
  • docs/architecture/reference/matrix-agentkit.md
  • docs/architecture/reference/contracts.md
  • docs/architecture/reference/matrix-codeflow.md
  • docs/architecture/reference/strategic/08-implementation-order.md
  • docs/architecture/reference/cross-system.md
  • docs/architecture/reference/operations-patterns.md
  • docs/architecture/reference/matrix-cognitive-mesh.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • docs/guides/README.md
  • docs/architecture/README.md

Comment on lines +19 to +20
C3[GitHub Webhooks]
C4[Operator Console]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Connect GitHub Webhooks in the container flow.

Line 19 defines C3[GitHub Webhooks], but it is never connected downstream. Please add the expected edge (likely into CodeFlow) so the diagram reflects actual ingress behavior.

Suggested doc fix
     C1 --> G1
     C2 --> G1
     C4 --> G1
+    C3 --> CF1

Also applies to: 58-83

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/02-container-architecture.md` around lines 19 - 20, The
diagram declares the node C3[GitHub Webhooks] but never connects it into the
flow; add an edge from C3 to the appropriate downstream node (e.g., CodeFlow or
whichever node represents ingress/CI processing) so the webhook is part of the
flow. Update the same missing connections in the later block (the repeated
section around lines 58-83) so both occurrences of C3 are connected (for
example: C3 --> CodeFlow or C3 --> <IngressNodeName>), ensuring the diagram
reflects actual webhook ingress behavior.

Comment on lines +35 to +39
subgraph Ingest
I1[OpenTelemetry]
I2[Azure Monitor]
I3[Blob Export]
end
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Document Prometheus metrics endpoint.

The actual LiteLLM configuration enables Prometheus callbacks (success_callback and failure_callback include "prometheus"), exposing metrics at /metrics. This is not reflected in the Ingest layer of the telemetry architecture diagram. Based on learnings from infra/modules/aigateway_aca/main.tf:95-113, Prometheus is a primary telemetry destination alongside OTEL.

📊 Suggested addition to the Ingest section

Consider adding Prometheus as a fourth ingest endpoint:

     subgraph Ingest
         I1[OpenTelemetry]
         I2[Azure Monitor]
         I3[Blob Export]
+        I4[Prometheus]
     end

And add the corresponding flow:

+    S5 --> I4
+    I4 --> A1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/04-observability-telemetry.md` around lines 35 - 39, The
Ingest subgraph is missing Prometheus as a telemetry destination; update the
diagram block (the Ingest subgraph containing I1/OpenTelemetry, I2/Azure
Monitor, I3/Blob Export) to add a fourth node for Prometheus (e.g.,
I4[Prometheus]) and add the flow from the application metrics endpoint
(/metrics) into that node; also update the surrounding text to mention that
LiteLLM enables Prometheus callbacks (success_callback/failure_callback include
"prometheus") and reference the infra/modules/aigateway_aca/main.tf:95-113
behavior so readers know Prometheus is a primary telemetry sink alongside OTEL,
Azure Monitor and Blob Export.

Comment on lines +42 to +52
```json
{
"intent": "code_review",
"complexity": "medium",
"tool_candidate": true,
"recommended_target": "codeflow-engine",
"recommended_model_tier": "small",
"escalation_required": false,
"confidence": 0.93
}
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Align example payload with declared contract shape.

The response example uses intent, recommended_target, and recommended_model_tier, but ClassifyRequestOutput defines label and recommended_tier. This mismatch makes the contract ambiguous for implementers.

Suggested doc fix (example aligned to contract)
 {
-  "intent": "code_review",
+  "label": "code_review",
   "complexity": "medium",
   "tool_candidate": true,
-  "recommended_target": "codeflow-engine",
-  "recommended_model_tier": "small",
-  "escalation_required": false,
+  "recommended_tier": "small",
+  "cacheable": true,
   "confidence": 0.93
 }

Also applies to: 57-74

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/reference/matrix-gateway.md` around lines 42 - 52, The
example response JSON does not match the declared contract type
ClassifyRequestOutput; update the example payload to use the contract's field
names (label and recommended_tier) and remove or rename
intent/recommended_target/recommended_model_tier accordingly so the example
reflects ClassifyRequestOutput's shape; ensure any referenced values (e.g.,
target engine or tier) are mapped into recommended_tier or documented separately
and apply the same fix to the other example block at lines 57-74.

Comment on lines +91 to +95
| Condition | Action |
| -------------------------------- | ---------------------- |
| `policy-screen.allowed == false` | Block or redact |
| `confidence < 0.70` | Escalate to LLM |
| Tool suggested but no mapping | Send to general LLM |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use one confidence threshold policy across fallback/config/table.

Line 94 says escalate on confidence < 0.70, while Lines 102 and 109-111 imply the boundary is < 0.75. Please unify this to one policy value to avoid conflicting routing behavior.

Based on learnings from docs/architecture/reference/strategic/02-gateway-slm-use-cases.md:84-90, the threshold guide uses 0.90 / 0.75.

Also applies to: 102-111

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/reference/matrix-gateway.md` around lines 91 - 95, Replace
the inconsistent confidence threshold text so the doc uses the same policy
values from the gateway SLM guide (0.90 / 0.75): update the table row currently
reading "confidence < 0.70" to the correct boundary (e.g., "confidence < 0.75"),
and ensure all other mentions of confidence-based routing (the lines referencing
escalation to LLM or fallback behavior) use the 0.90 / 0.75 thresholds
consistently.

Comment on lines +27 to +44
```
┌─────────────────────────────────────────────────────────┐
│ IMPORTANT - SAFETY BOUNDARY │
├─────────────────────────────────────────────────────────┤
│ Hard Decision Layer must NOT depend on free-form SLM │
│ │
│ SLM output is for OBSERVATION and REPORTING only: │
│ • Operator summaries │
│ • SOP suggestions (non-binding) │
│ • Mission log condensation │
│ │
│ SLM must NEVER be used for: │
│ • Autonomous threat response │
│ • Access control decisions │
│ • Resource isolation actions │
│ • Any kinetic or hard control actions │
└─────────────────────────────────────────────────────────┘
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language hint to the fenced safety block.

Line 27 uses an unlabeled fenced block, which triggers MD040 and can break consistent markdown tooling behavior.

Proposed fix
-```
+```text
 ┌─────────────────────────────────────────────────────────┐
 │                   IMPORTANT - SAFETY BOUNDARY            │
 ...
 └─────────────────────────────────────────────────────────┘
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.21.0)</summary>

[warning] 27-27: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/architecture/reference/matrix-rooivalk.md around lines 27 - 44, The
fenced safety ASCII-art block in matrix-rooivalk.md is unlabeled, triggering
MD040; update the block fence to include a language hint (e.g., add "text" after
the opening ```) so the safety boundary block becomes a labeled fenced code
block and stops triggering the lint rule while preserving the ASCII-art content.


</details>

<!-- fingerprinting:phantom:triton:hawk -->

<!-- This is an auto-generated comment by CodeRabbit -->

Comment on lines +266 to +274
1. [ ] Establish model registry with tiered selection
2. [ ] Implement cost tracking per project
3. [ ] Set up latency monitoring dashboards
4. [ ] Create edge deployment pipeline
5. [ ] Build security check pipeline
6. [ ] Define fallback hierarchies
7. [ ] Implement observability stack
8. [ ] Document model lifecycle process
9. [ ] **Add explicit safety boundary for PhoenixRooivalk**
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove or reword the completed safety-boundary action item.

Line 274 says “Add explicit safety boundary for PhoenixRooivalk,” but this is already present earlier (Lines 94-107 and Line 242). Keeping it unchecked creates planning drift.

Proposed fix
-9. [ ] **Add explicit safety boundary for PhoenixRooivalk**
+9. [x] **Add explicit safety boundary for PhoenixRooivalk**
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/reference/slm-management-plan.md` around lines 266 - 274,
The checklist contains a duplicate/completed item "Add explicit safety boundary
for PhoenixRooivalk" in the tail checklist; remove that line or change it to a
follow-up task (e.g., "Verify existing PhoenixRooivalk safety boundary and close
gaps") so it does not appear as an unchecked action; check for and reconcile
with the existing safety-boundary entries already documented in the earlier
"safety boundary" section and the other mention so the checklist reflects only
outstanding work.

Comment on lines +7 to +24
```
Agent Task
┌─────────────────────────────────────┐
│ SLM Execution Governor │
│ (tool selection, memory, budget) │
└─────────────────────────────────────┘
Tool Selection
├─→ GitHub API
├─→ Azure CLI
├─→ Terraform
├─→ Documentation Search
└─→ LLM Synthesis
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a fence language to satisfy markdownlint MD040.

The fenced block at Line 7 has no language. Use text (or mermaid if converted) to keep lint clean.

Suggested doc fix
-```
+```text
 Agent Task
       │
       ▼
 ...
       └─→ LLM Synthesis
</details>

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 7-7: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/systems/agentkit-forge.md` around lines 7 - 24, The code
block in the architecture diagram is missing a fence language which triggers
markdownlint MD040; update the opening triple-backtick for the ASCII diagram to
include a language (e.g., change ``` to ```text or ```mermaid if you convert to
mermaid) so the fenced block around the diagram (the block that begins with the
three backticks and contains "Agent Task │ ▼ ... └─→ LLM Synthesis") specifies a
language and satisfies the linter.

Comment on lines +7 to +24
```
Git Push / PR Event
┌─────────────────────────────────────┐
│ SLM Triage Layer │
│ (classification, risk, pipeline) │
└─────────────────────────────────────┘
CI/CD Decision
├─→ Auto approve
├─→ Run tests (full/minimal/skip)
├─→ Security review
└─→ Escalate to LLM
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language for the fenced architecture block.

The code fence starting at Line 7 is missing a language tag; set it to text to satisfy markdownlint MD040.

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 7-7: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/systems/codeflow-engine.md` around lines 7 - 24, The fenced
architecture diagram block is missing a language tag; update the triple-backtick
fence that opens the diagram (the ASCII flow block shown under "Git Push / PR
Event") to use the language identifier `text` (i.e., change ``` to ```text) so
markdownlint MD040 is satisfied while preserving the diagram content.

Comment on lines +94 to +100
prompt = f"""Select tests for this change:

Type: {change_type}
Files: {', '.join(impacted_files)}

Output: { "run_unit": bool, "run_integration": bool, "run_e2e": bool, "skip_reason": str|null }"""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify potential unescaped JSON braces inside Python f-strings in architecture docs.
rg -nP --type=md 'f"""[\s\S]*Output:\s*\{[^{}]*\}"""' docs/architecture

Repository: phoenixvc/ai-gateway

Length of output: 46


🏁 Script executed:

cat -n docs/architecture/systems/codeflow-engine.md | sed -n '90,105p'

Repository: phoenixvc/ai-gateway

Length of output: 560


Fix invalid Python f-string literal braces in the prompt example.

Line 99 embeds JSON braces directly in an f-string without escaping, which is invalid Python syntax. The curly braces in { "run_unit": bool, ... } must be doubled to {{ "run_unit": bool, ... }} to be treated as literal characters in the f-string. This code snippet will fail with a SyntaxError when copied into real code.

Suggested fix
-Output: { "run_unit": bool, "run_integration": bool, "run_e2e": bool, "skip_reason": str|null }"""
+Output: {{ "run_unit": bool, "run_integration": bool, "run_e2e": bool, "skip_reason": str|null }}"""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/systems/codeflow-engine.md` around lines 94 - 100, The
f-string assigned to variable prompt embeds literal JSON braces which must be
escaped; in the prompt construction (variable prompt using change_type and
impacted_files) double the outer curly braces around the JSON-like object (e.g.,
replace { "run_unit": ... } with {{ "run_unit": ... }}) so Python treats them as
literal braces in the f-string rather than expression delimiters.

Comment on lines +7 to +29
```
User Query
┌─────────────────────────────────────┐
│ SLM Control Fabric │
│ (routing, decomposition, compression)│
└─────────────────────────────────────┘
Routing Decision
├─→ Code Agent
├─→ Infra Agent
├─→ Security Agent
└─→ Research Agent
Specialist Work
LLM Synthesis (only when needed)
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language identifier to the architecture fenced block.

Line 7 starts a fenced block without a language tag, which triggers markdown linting (MD040). Use text for this ASCII diagram.

Suggested diff
-```
+```text
 User Query
       │
       ▼
 ┌─────────────────────────────────────┐
 │         SLM Control Fabric           │
 │  (routing, decomposition, compression)│
 └─────────────────────────────────────┘
       │
       ▼
 Routing Decision
       │
       ├─→ Code Agent
       ├─→ Infra Agent
       ├─→ Security Agent
       └─→ Research Agent
             │
             ▼
       Specialist Work
             │
             ▼
       LLM Synthesis (only when needed)
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.21.0)</summary>

[warning] 7-7: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/architecture/systems/cognitive-mesh.md around lines 7 - 29, The fenced
ASCII diagram block starting with should include a language identifier to satisfy markdown lint MD040; edit the block that begins with the triple backticks before "User Query" and change it to start withtext so the diagram
(the SLM Control Fabric / Routing Decision / Code Agent / Research Agent
section) is fenced as ```text.


</details>

<!-- fingerprinting:phantom:triton:hawk -->

<!-- This is an auto-generated comment by CodeRabbit -->

Refactor deployment process by extracting environment-specific deployment logic into a reusable workflow. This reduces code duplication across dev/staging/prod environments and centralizes deployment configuration parameters.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
docs/planning/request_to_token_attribution.md (1)

125-155: ⚠️ Potential issue | 🟡 Minor

Clarify “required” vs “alternative” integration path.

Line 125 says request-body metadata is required, but the section immediately presents headers as a second method. This still reads as conflicting requirement language.

✏️ Suggested wording tweak
-**Required:** Pass correlation metadata in request body when calling gateway. There are two methods:
+**Required (current implementation):** Pass correlation metadata in request body when calling gateway.
+**Alternative (future enhancement):** HTTP header-based correlation is possible only with additional middleware/wrapper support.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/planning/request_to_token_attribution.md` around lines 125 - 155, Change
the ambiguous "Required:" label to clearly indicate that request-body metadata
is the recommended/default integration (e.g., replace "Required:" with
"Recommended/Preferred") and add one sentence after "Method B: Via HTTP Headers"
stating that headers are an alternative option that requires additional LiteLLM
configuration or middleware; reference the existing headings "Method A: Via
Request Metadata (Recommended)" and "Method B: Via HTTP Headers" so the reader
knows which sections to update.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/deploy-environment.yaml:
- Around line 53-66: The reusable workflow currently only sets a subset of env
vars (TF_VAR_azure_openai_*, TF_VAR_gateway_key, TF_VAR_codex_*, etc.), causing
caller-level
secrets/TF_BACKEND_*/AZURE_CLIENT_ID/AZURE_TENANT_ID/AZURE_SUBSCRIPTION_ID,
EXPECTED_AOAI_ENDPOINT_HOST and TF_VAR_state_service_container_image to be empty
at apply time and breaking guards; update the env block to explicitly pass all
required variables from inputs/secrets (including AZURE_CLIENT_ID,
AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID, TF_BACKEND_*,
TF_VAR_state_service_container_image, EXPECTED_AOAI_ENDPOINT_HOST and any other
Terraform TF_VAR_* the plan relies on) so plan and apply see the same inputs,
and replace the invalid GitHub Actions function syntax if(inputs.env_name ==
'prod', '3', '1') with the supported ternary-like expression inputs.env_name ==
'prod' && '3' || '1' to restore the Line 211 value evaluation and ensure the
prod AOAI guard and Line 215 smoke gate run as intended.
- Line 211: The models_wait_attempts assignment uses the unsupported if()
expression; update the expression for models_wait_attempts to use GitHub
Actions' case() function to select '3' when inputs.env_name == 'prod' otherwise
'1' (reference the models_wait_attempts key in the deploy-environment workflow);
ensure the case() expression compares inputs.env_name to 'prod' and returns the
correct string values, or alternatively use the safe conditional pattern
(condition && value1 || value2) only if value1 cannot be falsy.

In @.github/workflows/deploy.yaml:
- Around line 154-170: The job-level environment key must be removed from the
reusable-workflow caller jobs (e.g., deploy-dev, deploy-staging, deploy-prod)
because jobs that use `uses:` cannot include `environment:`; update each caller
to only pass the env via the `env_name` input (already present) and then add
`environment: ${{ inputs.env_name }}` inside the called workflow
`deploy-environment.yaml` (use the `inputs.env_name` value in that workflow's
job definition) so the environment selection is handled by the reusable workflow
rather than the callers.

In `@docs/planning/request_to_token_attribution.md`:
- Line 221: The doc and Terraform behavior conflict because a new Terraform
output application_insights_connection_string exposes a secret while the policy
mandates routing secrets through Key Vault and never outputting them; either
remove the Terraform output or mark it as sensitive and change the deployment to
store the connection string in Key Vault (and update any consumers to read from
Key Vault), then update this doc (and the related note at lines ~229) to state
the chosen approach (no outputs for secrets or sensitive output with Key
Vault-backed handling) and reference the application_insights_connection_string
symbol and Key Vault usage so readers know how to retrieve the secret securely.

---

Duplicate comments:
In `@docs/planning/request_to_token_attribution.md`:
- Around line 125-155: Change the ambiguous "Required:" label to clearly
indicate that request-body metadata is the recommended/default integration
(e.g., replace "Required:" with "Recommended/Preferred") and add one sentence
after "Method B: Via HTTP Headers" stating that headers are an alternative
option that requires additional LiteLLM configuration or middleware; reference
the existing headings "Method A: Via Request Metadata (Recommended)" and "Method
B: Via HTTP Headers" so the reader knows which sections to update.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1703970d-036d-4366-b9a3-a25421b2c360

📥 Commits

Reviewing files that changed from the base of the PR and between c7d997a and 4c5e506.

📒 Files selected for processing (3)
  • .github/workflows/deploy-environment.yaml
  • .github/workflows/deploy.yaml
  • docs/planning/request_to_token_attribution.md

Comment on lines +53 to +66
env:
TF_VAR_env: ${{ inputs.env_name }}
TF_VAR_projname: "aigateway"
TF_VAR_location: "southafricanorth"
TF_VAR_location_short: "san"
TF_VAR_azure_openai_endpoint: ${{ secrets.AZURE_OPENAI_ENDPOINT }}
TF_VAR_azure_openai_api_key: ${{ secrets.AZURE_OPENAI_API_KEY }}
TF_VAR_azure_openai_embedding_endpoint: ${{ secrets.AZURE_OPENAI_EMBEDDING_ENDPOINT }}
TF_VAR_azure_openai_embedding_api_key: ${{ secrets.AZURE_OPENAI_EMBEDDING_API_KEY }}
TF_VAR_gateway_key: ${{ secrets.AIGATEWAY_KEY }}
TF_VAR_codex_model: ${{ inputs.codex_model }}
TF_VAR_codex_api_version: ${{ inputs.codex_api_version }}
TF_VAR_embedding_deployment: "text-embedding-3-large"
TF_VAR_embeddings_api_version: "2024-02-01"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Do reusable GitHub Actions workflows inherit the caller workflow's top-level env: values, or must those values be passed explicitly and rebound inside the called workflow?

💡 Result:

No—reusable (called) workflows do not inherit the caller workflow’s top-level env: values. GitHub explicitly states that “Any environment variables set in an env context defined at the workflow level in the caller workflow are not propagated to the called workflow.” [1]

So if the called workflow needs those values, you must pass them explicitly (typically via on: workflow_call: inputs: / jobs.<job_id>.with:) and then rebind/use them inside the called workflow. [1]


🏁 Script executed:

sed -n '211p' .github/workflows/deploy-environment.yaml

Repository: phoenixvc/ai-gateway

Length of output: 142


Rebuild the full deploy env inside the reusable workflow and fix Line 211 syntax.

Reusable workflows do not inherit the caller's top-level env: values—they must be passed explicitly. This workflow only recreates the AOAI/gateway subset, leaving AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID, TF_BACKEND_*, and other Terraform variables empty for Lines 79-137. The result is that apply executes with a different input set than the reviewed plan, TF_VAR_state_service_container_image stays empty so the Line 215 smoke gate never runs, and the prod AOAI guard at Lines 106-113 is inert because EXPECTED_AOAI_ENDPOINT_HOST is never exported. Additionally, Line 211 uses invalid syntax if(inputs.env_name == 'prod', '3', '1') — GitHub Actions has no if() function; use a ternary pattern instead: inputs.env_name == 'prod' && '3' || '1'.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/deploy-environment.yaml around lines 53 - 66, The reusable
workflow currently only sets a subset of env vars (TF_VAR_azure_openai_*,
TF_VAR_gateway_key, TF_VAR_codex_*, etc.), causing caller-level
secrets/TF_BACKEND_*/AZURE_CLIENT_ID/AZURE_TENANT_ID/AZURE_SUBSCRIPTION_ID,
EXPECTED_AOAI_ENDPOINT_HOST and TF_VAR_state_service_container_image to be empty
at apply time and breaking guards; update the env block to explicitly pass all
required variables from inputs/secrets (including AZURE_CLIENT_ID,
AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID, TF_BACKEND_*,
TF_VAR_state_service_container_image, EXPECTED_AOAI_ENDPOINT_HOST and any other
Terraform TF_VAR_* the plan relies on) so plan and apply see the same inputs,
and replace the invalid GitHub Actions function syntax if(inputs.env_name ==
'prod', '3', '1') with the supported ternary-like expression inputs.env_name ==
'prod' && '3' || '1' to restore the Line 211 value evaluation and ensure the
prod AOAI guard and Line 215 smoke gate run as intended.

aoai_api_key: ${{ env.TF_VAR_azure_openai_api_key }}
max_attempts: "3"
retry_sleep: ${{ inputs.smoke_retry_sleep }}
models_wait_attempts: ${{ if(inputs.env_name == 'prod', '3', '1') }}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Does GitHub Actions expression syntax support if()inside${{ }}expressions, or should ternary behavior be expressed with&&/||?

💡 Result:

GitHub Actions expressions do not provide an if() function you can call inside ${{ ... }}. ${{ ... }} supports literals, operators (including && / ||), and a fixed set of built-in functions—if() is not one of them. [1]

So for “ternary-like” value selection you have two options:

  • Recommended (now supported): use the case() expression function for real if/else behavior, e.g.
    ${{ case(condition, 'a', 'b') }}. [2][3]
  • Older workaround: use condition && valueIfTrue || valueIfFalse (note this can misbehave if valueIfTrue is falsy like '', 0, or false, because || will then fall through). && / || are the supported logical operators. [1]

Sources:
[1] GitHub Docs – Operators in expressions (&&, ||)
[2] GitHub Docs – case() function reference and examples
[3] GitHub Changelog (Jan 29, 2026) – announcement of new case function


Replace the unsupported if() expression with the case() function.

GitHub Actions expressions do not support if() as a function inside ${{ }}. Use the case() expression function for conditional value selection:

Suggested fix
-          models_wait_attempts: ${{ if(inputs.env_name == 'prod', '3', '1') }}
+          models_wait_attempts: ${{ case(inputs.env_name == 'prod', '3', '1') }}

Alternatively, the condition && value1 || value2 pattern works for non-falsy values but has pitfalls if value1 is falsy.

🧰 Tools
🪛 actionlint (1.7.11)

[error] 211-211: undefined function "if". available functions are "always", "cancelled", "case", "contains", "endswith", "failure", "format", "fromjson", "hashfiles", "join", "startswith", "success", "tojson"

(expression)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/deploy-environment.yaml at line 211, The
models_wait_attempts assignment uses the unsupported if() expression; update the
expression for models_wait_attempts to use GitHub Actions' case() function to
select '3' when inputs.env_name == 'prod' otherwise '1' (reference the
models_wait_attempts key in the deploy-environment workflow); ensure the case()
expression compares inputs.env_name to 'prod' and returns the correct string
values, or alternatively use the safe conditional pattern (condition && value1
|| value2) only if value1 cannot be falsy.

Comment on lines +154 to +170
uses: ./.github/workflows/deploy-environment.yaml
with:
env_name: dev
tf_state_key: dev.terraform.tfstate
codex_model: gpt-5.3-codex
codex_api_version: 2025-04-01-preview
terraform_working_directory: infra/env/dev
smoke_retry_sleep: "10"
smoke_models_wait_sleep: "15"
include_aoai_host_check: false
secrets:
AZURE_OPENAI_ENDPOINT: ${{ secrets.AZURE_OPENAI_ENDPOINT }}
AZURE_OPENAI_API_KEY: ${{ secrets.AZURE_OPENAI_API_KEY }}
AZURE_OPENAI_EMBEDDING_ENDPOINT: ${{ secrets.AZURE_OPENAI_EMBEDDING_ENDPOINT }}
AZURE_OPENAI_EMBEDDING_API_KEY: ${{ secrets.AZURE_OPENAI_EMBEDDING_API_KEY }}
AIGATEWAY_KEY: ${{ secrets.AIGATEWAY_KEY }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Does GitHub Actions allow environment:on a job that calls a reusable workflow viauses:? If not, what is the documented way to bind the called job to dev, staging, or prod?

💡 Result:

No. A job that calls a reusable workflow via jobs.<job_id>.uses: can only use a limited set of keywords, and environment is not one of them (so environment: on the calling job is not supported). [1]

Documented way to bind to dev / staging / prod:

  1. Pass the target environment name as an input from the caller (often via a matrix of dev/stage/prod). [2]
  2. Set environment: inside the reusable workflow’s job, using that input (the environment name can be an expression and can use the inputs context). [3]
  3. This is also why environment secrets can’t be passed from the caller via workflow_call (it doesn’t support environment), so the environment needs to be selected in the called workflow’s job. [2]

Example:

Caller:

jobs:
  deploy:
    strategy:
      matrix:
        target: [dev, staging, prod]
    uses: org/repo/.github/workflows/deploy.yml@main
    with:
      target: ${{ matrix.target }}

[2]

Reusable workflow (deploy.yml):

on:
  workflow_call:
    inputs:
      target:
        required: true
        type: string

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment:
      name: ${{ inputs.target }}
    steps:
      - run: echo "Deploying to ${{ inputs.target }}"

[3]

Sources: GitHub Docs on reusable-workflow keyword limits [1], reusable workflows + environment-secret note + matrix example [2], and workflow syntax for jobs.<job_id>.environment (name can use inputs) [3].


🏁 Script executed:

cd .github/workflows && head -n 220 deploy.yaml | tail -n 70

Repository: phoenixvc/ai-gateway

Length of output: 2993


Remove environment: from all three reusable-workflow caller jobs.

Jobs that call a reusable workflow via uses: cannot use the environment: key. Currently, deploy-dev, deploy-staging, and deploy-prod all specify environment: at the job level alongside uses:, which violates GitHub Actions job schema. Move environment selection into the called workflow (deploy-environment.yaml) by setting environment: there with the input value (e.g., environment: ${{ inputs.env_name }}). The caller jobs already pass the environment name correctly via env_name input and only need the invalid environment: key removed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/deploy.yaml around lines 154 - 170, The job-level
environment key must be removed from the reusable-workflow caller jobs (e.g.,
deploy-dev, deploy-staging, deploy-prod) because jobs that use `uses:` cannot
include `environment:`; update each caller to only pass the env via the
`env_name` input (already present) and then add `environment: ${{
inputs.env_name }}` inside the called workflow `deploy-environment.yaml` (use
the `inputs.env_name` value in that workflow's job definition) so the
environment selection is handled by the reusable workflow rather than the
callers.

- cognitive-mesh: Pass correlation metadata in request body
- pvc-costops-analytics: Must create KQL queries for new event shape
- infra: May need custom LiteLLM container image or OTEL collector
- infra: Application Insights resource + APPLICATIONINSIGHTS_CONNECTION_STRING wiring added; trace export requires custom LiteLLM image (with azure-monitor-opentelemetry) or explicit OTEL_EXPORTER_OTLP_ENDPOINT configuration (currently empty by default)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Documented secret-handling policy conflicts with Terraform output behavior.

This doc says secrets should go through Key Vault and (elsewhere in this file) says to never output secrets from Terraform, but the PR summary indicates a new application_insights_connection_string Terraform output was added. Please reconcile this to avoid normalizing secret exposure via outputs.

🔒 Suggested doc correction (if output remains temporarily)
-- infra: Application Insights resource + APPLICATIONINSIGHTS_CONNECTION_STRING wiring added; trace export requires custom LiteLLM image (with azure-monitor-opentelemetry) or explicit OTEL_EXPORTER_OTLP_ENDPOINT configuration (currently empty by default)
+- infra: Application Insights resource + Key Vault-backed APPLICATIONINSIGHTS_CONNECTION_STRING wiring added; avoid exposing connection strings as Terraform outputs, and use OTEL_EXPORTER_OTLP_ENDPOINT/custom image as required for export path

Also applies to: 229-229

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/planning/request_to_token_attribution.md` at line 221, The doc and
Terraform behavior conflict because a new Terraform output
application_insights_connection_string exposes a secret while the policy
mandates routing secrets through Key Vault and never outputting them; either
remove the Terraform output or mark it as sensitive and change the deployment to
store the connection string in Key Vault (and update any consumers to read from
Key Vault), then update this doc (and the related note at lines ~229) to state
the chosen approach (no outputs for secrets or sensitive output with Key
Vault-backed handling) and reference the application_insights_connection_string
symbol and Key Vault usage so readers know how to retrieve the secret securely.

@JustAGhosT JustAGhosT merged commit 2bc7860 into dev Mar 15, 2026
1 check passed
@JustAGhosT JustAGhosT deleted the feat/infra-alignment branch March 15, 2026 11:44
@coderabbitai coderabbitai bot mentioned this pull request Mar 15, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant