Skip to content

feat: add CloudWatchProvider to pull remote cloudwatch traces and run evals against them.#147

Merged
afarntrog merged 10 commits intostrands-agents:mainfrom
afarntrog:trace_provider_cloudwatch
Feb 27, 2026
Merged

feat: add CloudWatchProvider to pull remote cloudwatch traces and run evals against them.#147
afarntrog merged 10 commits intostrands-agents:mainfrom
afarntrog:trace_provider_cloudwatch

Conversation

@afarntrog
Copy link
Contributor

@afarntrog afarntrog commented Feb 26, 2026

Description

Add CloudWatchProvider — a TraceProvider implementation that fetches agent execution traces from AWS CloudWatch Logs (Bedrock AgentCore runtime logs) and converts them into the evaluation pipeline's Session format.

This enables running evaluators (OutputEvaluator, CoherenceEvaluator, HelpfulnessEvaluator, etc.) against production agent traces without re-executing the agent. For example:

from strands_evals import Case, Experiment
from strands_evals.evaluators import CoherenceEvaluator, OutputEvaluator
from strands_evals.providers import CloudWatchProvider

provider = CloudWatchProvider(
    log_group="/aws/bedrock-agentcore/runtimes/my-agent-abc123-DEFAULT",
    region="us-east-1",
)

def task(case: Case) -> dict:
    return provider.get_evaluation_data(case.input)

cases = [Case(name="session_1", input="my-session-id", expected_output="any")]
experiment = Experiment(cases=cases, evaluators=[OutputEvaluator(...), CoherenceEvaluator()])
reports = experiment.run_evaluations(task)

What's included

CloudWatchProvider (src/strands_evals/providers/cloudwatch_provider.py)

  • Implements the TraceProvider.get_evaluation_data(session_id) interface
  • Queries CloudWatch Logs Insights using attributes.session.id to fetch OTEL log records
  • Supports two initialization modes: explicit log_group or automatic discovery via agent_name
  • Configurable lookback_days (default 30) and query_timeout_seconds (default 60)
  • Polls query results with exponential backoff (0.5s → 8s max)

CloudWatchSessionMapper (src/strands_evals/mappers/cloudwatch_session_mapper.py)

  • Converts raw CW Logs Insights JSON records into SessionTraceSpan hierarchy
  • Groups records by traceId, sorts by timeUnixNano
  • Handles the double-encoded content format used by AgentCore runtime logs (e.g., body.input.messages[].content.content contains JSON-encoded arrays)
  • Produces three span types from each record:
    • AgentInvocationSpan — user prompt, agent response, available tools
    • InferenceSpan — full message list (user + assistant messages)
    • ToolExecutionSpan — one per tool call, matched to tool results by toolCallId

Related Issues

#140

Documentation PR

Added src/strands_evals/providers/README.md with quick start, core API, error handling, and custom provider examples. Documentation in strandsagents.com will come next

Type of Change

New feature

Testing

Unit tests (43 tests)

tests/strands_evals/providers/test_cloudwatch_provider.py — 25 tests covering:

  • Constructor: explicit log_group, agent_name discovery, region resolution (AWS_REGION vs AWS_DEFAULT_REGION vs explicit), error cases
  • CW Logs Insights polling: happy path, intermediate statuses, failed/timeout/empty, JSON parsing
  • get_evaluation_data: session lookup, multi-trace sessions, output extraction from last AgentInvocationSpan, error propagation

tests/strands_evals/mappers/test_cloudwatch_session_mapper.py — 11 tests covering:

  • Span conversion: inference spans, agent invocation spans, tool execution spans, tool call/result matching by ID
  • Session building: multi-record grouping by trace_id, empty/malformed record handling
  • Double-encoded content parsing

tests/strands_evals/providers/test_trace_provider.py — 7 tests (updated to remove TraceNotFoundError references)

Integration tests

tests_integ/test_cloudwatch_provider.py — 11 tests against real CloudWatch data (account 249746592913):

  • Session fetching, trace/span structure validation, span type verification
  • End-to-end: CloudWatch → OutputEvaluator, CoherenceEvaluator, HelpfulnessEvaluator pipeline

Manual verification

Ran against two live production sessions from the github_issue_handler agent:

  • github_issue_1764_20260225_162526_f306fd58 (10 spans, 1 trace) — Output: 1.0, Coherence: 1.0, Helpfulness: 0.833

  • github_issue_1760_20260224_201557_ce6edb42 (6 spans, 1 trace) — Output: 1.0, Coherence: 1.0, Helpfulness: 0.833

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Introduce an abstract TraceProvider base class for retrieving agent
trace data from observability backends for evaluation. This includes:

- TraceProvider ABC with get_session, list_sessions, and
  get_session_by_trace_id methods
- SessionFilter dataclass for filtering session discovery
- Custom error hierarchy (TraceProviderError, SessionNotFoundError,
  TraceNotFoundError, ProviderError)
- Session and Trace data types with span tree construction and
  convenience accessors (input/output messages, token usage, duration)
- New providers module exposed at package level
- Comprehensive unit tests for providers and trace types
Add abstract TraceProvider that retrieves agent trace data from
observability backends and returns Session/Trace types the evals
system already consumes.

- TraceProvider ABC with get_session() (required), list_sessions()
  and get_session_by_trace_id() (optional, raise NotImplementedError)
- SessionFilter dataclass for time-range and limit-based discovery
- Exception hierarchy: TraceProviderError base with
  SessionNotFoundError, TraceNotFoundError, ProviderError
- Export providers module from strands_evals package
Add abstract TraceProvider that retrieves agent trace data from
observability backends and returns Session/Trace types the evals
system already consumes.

- TraceProvider ABC with get_session() (required), list_sessions()
  and get_session_by_trace_id() (optional, raise NotImplementedError)
- SessionFilter dataclass for time-range and limit-based discovery
- Exception hierarchy: TraceProviderError base with
  SessionNotFoundError, TraceNotFoundError, ProviderError
- Export providers module from strands_evals package
Implement LangfuseProvider that fetches agent traces from Langfuse and
converts them to Session objects for the evals pipeline. Supports
session-level and trace-level retrieval with paginated API calls.

- get_evaluation_data(): fetch traces by session ID, convert Langfuse
  observations to typed spans (InferenceSpan, ToolExecutionSpan,
  AgentInvocationSpan), extract output from last agent invocation
- list_sessions(): paginated session discovery with time-range filtering
- get_evaluation_data_by_trace_id(): single trace retrieval
- Host resolution: explicit param > LANGFUSE_HOST env var > cloud default
- 30 unit tests (mocked SDK), 15 integration tests (real Langfuse + evaluators)
Add TraceProvider interface and implementations for fetching agent
execution data from observability backends (CloudWatch Logs and
Langfuse). This enables running evaluators against production/staging
traces without re-executing agents.

- Add CloudWatchProvider for Bedrock AgentCore runtime logs
- Add LangfuseProvider for Langfuse-hosted traces
- Add TraceProvider base interface with get_evaluation_data API
- Add comprehensive test suites for both providers
- Add providers README with usage documentation
- Fix minor whitespace issues in CoherenceEvaluator docstring
Add a new session mapper that converts CloudWatch Logs Insights OTEL
log records into typed Session objects for evaluation. The mapper parses
body.input/output messages, extracts tool calls/results, and builds
InferenceSpan, ToolExecutionSpan, and AgentInvocationSpan instances
grouped by traceId. Includes comprehensive unit tests.
Add a new session mapper that converts CloudWatch Logs Insights OTEL
log records into typed Session objects for evaluation. The mapper parses
body.input/output messages, extracts tool calls/results, and builds
InferenceSpan, ToolExecutionSpan, and AgentInvocationSpan instances
grouped by traceId. Includes comprehensive unit tests.
- Export CloudWatchProvider via lazy-loading in providers __init__.py
- Add comprehensive docstring to CloudWatchProvider.__init__ with
  usage examples and parameter descriptions
- Extract shared CloudWatch test helpers into a reusable
  cloudwatch_helpers module to reduce duplication across test files
@poshinchen
Copy link
Contributor

Discussed offline, we'll test the latest semantic conventions as a follow-up after this change has been merged.

@afarntrog afarntrog deployed to auto-approve February 27, 2026 20:16 — with GitHub Actions Active
@afarntrog afarntrog merged commit 12a2b8d into strands-agents:main Feb 27, 2026
14 of 45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants