feat: Add OTEL-enhanced ext_proc for zero-agent GenAI observability by Ladas · Pull Request #119 · kagenti/kagenti-extensions

Ladas · 2026-02-17T21:17:54Z

Summary

Adds AuthBridge/otel-ext-proc/ — a Go gRPC ext_proc that creates OpenTelemetry root spans and nested child spans by parsing A2A SSE stream events. Enables full GenAI observability with zero OTEL code in agents.

Design least-invasive OTEL GenAI observability for agents (zero/minimal code changes) kagenti#667 (design issue)
feat: Approach A - AuthBridge ext_proc root span for zero-agent OTEL observability kagenti#668 (kagenti PR with Helm/deploy changes)
feat: Minimal agent for AuthBridge OTEL (Approach A, zero custom observability) agent-examples#122 (zero-OTEL agent)

Features

Root span invoke_agent {name} with gen_ai.* attributes
Child spans chat {model} and execute_tool {name} from SSE events
Token usage extraction from JSON LangChain messages
W3C traceparent injection for distributed tracing
Graceful client disconnect with tasks/get fallback
Span names per OTel GenAI spec

Architecture

ext_proc sets gen_ai.* only → OTEL Collector transforms → MLflow (mlflow.*) + Phoenix (openinference.*)

Test plan

38 E2E tests pass on HyperShift cluster
3 Playwright UI tests pass
Token counts visible in MLflow and Phoenix
Sessions appear in MLflow chat-sessions UI

🤖 Generated with Claude Code

Add OpenTelemetry root span creation and SSE stream parsing to the existing go-processor. When OTEL_TRACING_ENABLED=true, the ext_proc: - Creates invoke_agent root spans with GenAI semantic conventions - Injects W3C traceparent so agent spans become children - Parses A2A request body for user input and conversation ID - Parses SSE response stream for LLM/tool child spans with token counts - Handles client disconnect via tasks/resubscribe fallback - Extracts agent output from artifact events or tasks/get All existing functionality (JWT validation, token exchange, resolver) is preserved. OTEL is opt-in via environment variable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ladislav Smola <lsmola@redhat.com>

When the A2A response is a buffered JSON-RPC response (not SSE events), the ext_proc now extracts child spans from result.history. This handles the case where the A2A SDK returns the complete task with history messages instead of streaming SSE events. The history messages contain the same LangGraph step format (🚶‍♂️assistant: / 🚶‍♂️tools:) as SSE events, with full LangChain metadata for token counts, model names, and tool call details. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ladislav Smola <lsmola@redhat.com>

The A2A SDK merges the final LLM response into the artifact/completion rather than including it in result.history. This means the ext_proc only sees 2 of the 3 LangGraph steps (LLM→tool→LLM). When the last history event is a tool call and the task completed with an artifact, infer the final LLM call and create a "chat" child span for it with the artifact text as gen_ai.completion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ladislav Smola <lsmola@redhat.com>

pdettori

Review Summary

Solid architecture — root spans with child spans parsed from SSE events, W3C traceparent injection, and disconnect recovery via tasks/resubscribe. The GenAI semantic convention usage looks correct and the opt-in via OTEL_TRACING_ENABLED is clean.

Two blocking issues: mutex held during OTEL SDK calls (contention risk under concurrent streams), and JSON injection via unescaped taskID in fmt.Sprintf. Several suggestions for hardcoded values, Docker build reproducibility, and unbounded span attributes.

Areas reviewed: Go (concurrency, error handling, security), Dockerfile, go.mod dependencies
Commits: 3 commits, all signed-off ✓
CI: All checks passing ✓

Note on commit attribution: All 3 commits use Co-Authored-By: Claude Opus 4.6. The kagenti convention prefers Assisted-By for AI attribution to avoid inflating contributor stats. Consider amending if the extensions repo follows the same policy.

pdettori · 2026-03-11T23:02:18Z

AuthBridge/AuthProxy/go-processor/main.go

+// For each SSE chunk, it parses events and creates nested child spans for
+// LLM and tool events. On end_of_stream, it sets the output on the root span.
+func (p *processor) handleResponseBody(stream v3.ExternalProcessor_ProcessServer, body []byte, endOfStream bool) *v3.ProcessingResponse {
+	p.mu.Lock()


must-fix: Mutex held across OTEL SDK calls and SSE parsing

p.mu.Lock() is acquired here and not released until the manual unlock in two branches (~37 lines later). The entire SSE parsing loop — parseSSEEvents, classifySSEEvent, createChildSpan (which calls otelTracer.Start) — runs while the mutex is held.

Under concurrent streams this creates contention and potential deadlock risk if any OTEL SDK call blocks.

Suggested fix: Copy state out under the lock, then operate on the copy:

p.mu.Lock() state := p.streamSpans[stream] p.mu.Unlock() if state != nil { // SSE parsing and span operations on state (no lock needed) } // Re-acquire lock only for map mutation if endOfStream { p.mu.Lock() delete(p.streamSpans, stream) p.mu.Unlock() }

Also consider using defer p.mu.Unlock() instead of manual unlock in two branches — the current pattern is fragile if new code paths are added.

pdettori · 2026-03-11T23:02:18Z

AuthBridge/AuthProxy/go-processor/main.go

+// resubscribeAndCapture opens a new SSE streaming connection to the agent's
+// tasks/resubscribe endpoint for disconnect recovery.
+func resubscribeAndCapture(cancelCtx context.Context, taskID string, span trace.Span, spanCtx context.Context, startIndex int) (string, int) {
+	reqBody := fmt.Sprintf(`{"jsonrpc":"2.0","id":"ext-proc-resub","method":"tasks/resubscribe","params":{"id":"%s"}}`, taskID)


must-fix: JSON injection via unescaped taskID in fmt.Sprintf

taskID is interpolated directly into a JSON string with %s. If the agent ever returns a task ID containing " or \, this produces malformed or potentially manipulated JSON. Same issue at line 1486 (tasks/get).

Fix: Use encoding/json to marshal the request struct:

type jsonRPCRequest struct { JSONRPC string `json:"jsonrpc"` ID string `json:"id"` Method string `json:"method"` Params interface{} `json:"params"` } reqBytes, _ := json.Marshal(jsonRPCRequest{ JSONRPC: "2.0", ID: "ext-proc-resub", Method: "tasks/resubscribe", Params: map[string]string{"id": taskID}, })

pdettori · 2026-03-11T23:02:18Z

AuthBridge/AuthProxy/go-processor/main.go

+func resubscribeAndCapture(cancelCtx context.Context, taskID string, span trace.Span, spanCtx context.Context, startIndex int) (string, int) {
+	reqBody := fmt.Sprintf(`{"jsonrpc":"2.0","id":"ext-proc-resub","method":"tasks/resubscribe","params":{"id":"%s"}}`, taskID)
+
+	req, err := http.NewRequestWithContext(cancelCtx, "POST", "http://127.0.0.1:8000/", strings.NewReader(reqBody))


suggestion: Hardcoded agent URL http://127.0.0.1:8000/

This appears here and again at line 1486 (fetchTaskResult). If the agent runs on a different port or the ext_proc is not co-located, these calls silently fail.

Consider making this configurable:

agentURL = getEnvOrDefault("AGENT_URL", "http://127.0.0.1:8000")

Also note: these requests carry no Authorization header. If the agent requires auth, the calls will fail. The trust model (ext_proc on loopback) should be documented.

pdettori · 2026-03-11T23:02:18Z

AuthBridge/AuthProxy/go-processor/main.go

+// OTEL tracing setup
+// ============================================================================
+
+func initOtelTracing() error {


suggestion: Default values are domain-specific examples

"weather-assistant", "weather-service" are baked in as defaults. A misconfigured deployment will silently report all traces as coming from weather-assistant. Consider either:

Empty string defaults with a startup warning log

Or at minimum, values like "unknown-agent" / "unknown-service" that are obviously wrong in dashboards

pdettori · 2026-03-11T23:02:18Z

AuthBridge/AuthProxy/go-processor/Dockerfile

 COPY go-processor/ ./go-processor/

-RUN CGO_ENABLED=0 GOOS=linux go build -o /go-processor ./go-processor
+RUN go mod tidy && CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build -o /go-processor ./go-processor


suggestion: go mod tidy at build time breaks reproducibility

Running go mod tidy during the Docker build means the build can pull different transitive dependency versions at different times. Combined with go.sum* glob on line 7 (which allows a missing go.sum entirely), this can silently introduce new dependencies.

Standard practice: commit go.sum, COPY go.sum (not go.sum*), go mod download for caching, then build.

pdettori · 2026-03-11T23:02:18Z

AuthBridge/AuthProxy/go-processor/Dockerfile


-COPY go.mod go.sum ./
-RUN go mod download
+COPY go.mod go.sum* ./


suggestion: go.sum* glob allows building without a lock file

If go.sum is absent, the * glob silently skips it and go mod tidy on line 11 generates it at build time. This means builds are not pinned to specific dependency versions. Consider committing go.sum and removing the glob.

pdettori · 2026-03-11T23:02:18Z

AuthBridge/AuthProxy/go-processor/main.go

+	// MLflow/OpenInference attributes are derived by the OTEL Collector.
+	if userInput != "" {
+		state.span.SetAttributes(
+			attribute.String("gen_ai.prompt", userInput),


suggestion: gen_ai.prompt attribute has no size limit

userInput is set as a span attribute without truncation. Later in the code, artifact output is truncated to 1000 chars (good), but prompts are not. Very long prompts could cause issues with some OTEL collectors or backends.

Consider applying the same truncation pattern used for gen_ai.completion.

pdettori · 2026-03-11T23:02:18Z

AuthBridge/AuthProxy/go-processor/main.go

+			buf = append(buf, readBuf[:n]...)
+
+			for {
+				idx := strings.Index(string(buf), "\n\n")


nit: string(buf) conversion on every read loop iteration

This performs a full []byte to string copy and linear scan on every iteration. For a long-running SSE stream, this is O(n) per read where n is the total buffered data.

// Use bytes.Index instead: idx := bytes.Index(buf, []byte("\n\n"))

pdettori · 2026-03-11T23:02:18Z

AuthBridge/AuthProxy/go-processor/main.go

+
+	// Skip OTEL span creation for non-API paths (agent card, health)
+	reqPath := getHeaderValue(headers.Headers, ":path")
+	isAPIRequest := reqPath == "/" || strings.HasPrefix(reqPath, "/?")


nit: Path detection may be too narrow

This only matches / and /?*, so A2A requests to sub-paths (e.g., /api/v1/agent) would skip tracing entirely. If that is intentional, a comment explaining which paths are expected would help. If not, consider a positive check for non-API paths instead:

isNonAPIPath := reqPath == "/.well-known/agent-card.json" || reqPath == "/health" || reqPath == "/healthz" isAPIRequest := !isNonAPIPath

Ladas force-pushed the feat/otel-ext-proc branch from b0623b0 to e42f45c Compare February 17, 2026 22:10

Ladas and others added 2 commits February 18, 2026 10:52

pdettori added this to Kagenti Issue Prioritization Mar 4, 2026

github-project-automation bot moved this to Backlog in Kagenti Issue Prioritization Mar 4, 2026

pdettori requested changes Mar 11, 2026

View reviewed changes

rubambiza mentioned this pull request Mar 23, 2026

Org Weekly Report 2026-03-16 -- 2026-03-23 kagenti/kagenti#1094

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add OTEL-enhanced ext_proc for zero-agent GenAI observability#119

feat: Add OTEL-enhanced ext_proc for zero-agent GenAI observability#119
Ladas wants to merge 3 commits intokagenti:mainfrom
Ladas:feat/otel-ext-proc

Ladas commented Feb 17, 2026

Uh oh!

pdettori left a comment

Uh oh!

pdettori Mar 11, 2026

Uh oh!

pdettori Mar 11, 2026

Uh oh!

pdettori Mar 11, 2026

Uh oh!

pdettori Mar 11, 2026

Uh oh!

pdettori Mar 11, 2026

Uh oh!

pdettori Mar 11, 2026

Uh oh!

pdettori Mar 11, 2026

Uh oh!

pdettori Mar 11, 2026

Uh oh!

pdettori Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Ladas commented Feb 17, 2026

Summary

Related

Features

Architecture

Test plan

Uh oh!

pdettori left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants