Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 50 additions & 11 deletions docs/planning/request_to_token_attribution.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,20 +29,37 @@ Document the current architecture:

## What Can Be Done in This Repo

### Phase 1: LiteLLM Custom Callback (Core Implementation)
### Phase 1: OpenTelemetry Integration (Core Implementation)

**Files to modify:**
**Implementation Approach:**

- `infra/modules/aigateway_aca/main.tf` - Add custom callback config
- Create new custom callback module
Instead of a custom callback (which requires a custom LiteLLM image), we're using LiteLLM's built-in OpenTelemetry support. This provides structured traces with token telemetry out of the box.

**Approach:** Use LiteLLM's CustomLogger class or success_callback to emit structured token telemetry
**Changes Made:**

**Challenge:** LiteLLM runs as a container image - need to either:
- Added `otel` to LiteLLM `success_callback` and `failure_callback` in `infra/modules/aigateway_aca/main.tf`
- Added OTEL environment variables:
- `OTEL_SERVICE_NAME` - service name for traces
- `OTEL_TRACER_NAME` - tracer name
- `OTEL_EXPORTER_OTLP_ENDPOINT` - OTLP collector endpoint
- `OTEL_EXPORTER_OTLP_PROTOCOL` - protocol (http/json)
- Added new variables in `infra/modules/aigateway_aca/variables.tf`:
- `otel_exporter_endpoint` - OTLP collector URL
- `otel_service_name` - custom service name

1. Build a custom image with callback baked in
2. Use environment variables + config-based callback
3. Add a sidecar container
**How It Works:**

LiteLLM's OTEL callback automatically emits spans with:

- Model name, provider, deployment
- Token usage (prompt_tokens, completion_tokens, total_tokens)
- Duration
- Request/response metadata

**Files Modified:**

- `infra/modules/aigateway_aca/main.tf` - Added OTEL callback and env vars
- `infra/modules/aigateway_aca/variables.tf` - Added OTEL configuration variables

### Phase 2: Correlation ID Propagation

Expand All @@ -59,7 +76,27 @@ Document the current architecture:

### 1. cognitive-mesh (Upstream Caller)

Required: Must pass correlation headers when calling gateway:
**Required:** Must pass correlation headers when calling gateway. There are two methods:

**Method A: Via Request Metadata (Recommended)**
Pass correlation IDs in the request body `metadata` field:

```json
{
"model": "gpt-5.3-codex",
"messages": [{ "role": "user", "content": "Hello" }],
"metadata": {
"request_id": "req_123",
"session_id": "sess_456",
"workflow": "manual_orchestration",
"stage": "writer",
"endpoint": "/api/manual-orchestration/sessions/start",
"user_id": "user_abc"
}
}
```

**Method B: Via HTTP Headers**

- x-request-id
- x-session-id
Expand All @@ -68,9 +105,11 @@ Required: Must pass correlation headers when calling gateway:
- x-stage-name
- x-user-id

_Note: Method B requires additional LiteLLM configuration or middleware._

### 2. pvc-costops-analytics (Downstream Analytics)

Required: KQL queries and dashboards to:
**Required:** KQL queries and dashboards to:

- Join requests table to token events by operation_Id/request_id
- Aggregate rollups by endpoint, workflow, stage, model, deployment
Expand Down
39 changes: 33 additions & 6 deletions infra/modules/aigateway_aca/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ locals {
# Features enabled here:
# - JSON structured logging → Log Analytics Workspace via Container Apps stdout
# - Prometheus /metrics endpoint (built-in, no extra infra)
# - OpenTelemetry tracing for request-to-token attribution
# - Langfuse tracing (when both langfuse_public_key and langfuse_secret_key are provided)
# - Redis semantic caching (when enable_redis_cache = true)
# - Global budget / rate limits (when set above 0)
Expand Down Expand Up @@ -73,16 +74,13 @@ locals {
num_retries: 2
json_logs: true
# Prometheus /metrics: token usage, latency and error rate at <gateway>/metrics
# OpenTelemetry: structured traces with token telemetry for attribution
success_callback:
- prometheus
%{if var.langfuse_public_key != "" && var.langfuse_secret_key != ""~}
- langfuse
%{endif}
- otel
failure_callback:
- prometheus
%{if var.langfuse_public_key != "" && var.langfuse_secret_key != ""~}
- langfuse
%{endif}
- otel
%{if var.enable_redis_cache~}
# Redis: deduplicate identical requests to reduce Azure OpenAI token spend
cache: true
Expand Down Expand Up @@ -402,6 +400,35 @@ resource "azurerm_container_app" "ca" {
}
}

# OpenTelemetry configuration for request-to-token attribution
# OTEL_EXPORTER_OTLP_ENDPOINT should be set to your OTLP collector URL
# e.g., https://your-collector.eastus.azure.com:4318
env {
name = "OTEL_SERVICE_NAME"
value = "ai-gateway-${var.env}"
}

env {
name = "OTEL_TRACER_NAME"
value = "litellm"
}

dynamic "env" {
for_each = var.otel_exporter_endpoint != "" ? [var.otel_exporter_endpoint] : []
content {
name = "OTEL_EXPORTER_OTLP_ENDPOINT"
value = env.value
}
}

dynamic "env" {
for_each = var.otel_exporter_endpoint != "" ? ["http/json"] : []
content {
name = "OTEL_EXPORTER_OTLP_PROTOCOL"
value = env.value
}
}

# LiteLLM commonly listens on 4000; set port as needed
}
}
Expand Down
13 changes: 13 additions & 0 deletions infra/modules/aigateway_aca/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -211,3 +211,16 @@ variable "tpm_limit" {
description = "Global tokens-per-minute cap across all API keys (0 = no limit)."
default = 0
}

# OpenTelemetry configuration for request-to-token attribution
variable "otel_exporter_endpoint" {
type = string
description = "OpenTelemetry OTLP exporter endpoint (e.g., https://collector.example.com:4318). Leave empty to disable OTEL tracing."
default = ""
}

variable "otel_service_name" {
type = string
description = "OpenTelemetry service name for tracing."
default = "ai-gateway"
}
Loading