feat: add CloudWatchProvider to pull remote cloudwatch traces and run evals against them. by afarntrog · Pull Request #147 · strands-agents/evals

afarntrog · 2026-02-26T15:19:17Z

Description

Add CloudWatchProvider — a TraceProvider implementation that fetches agent execution traces from AWS CloudWatch Logs (Bedrock AgentCore runtime logs) and converts them into the evaluation pipeline's Session format.

This enables running evaluators (OutputEvaluator, CoherenceEvaluator, HelpfulnessEvaluator, etc.) against production agent traces without re-executing the agent. For example:

from strands_evals import Case, Experiment
from strands_evals.evaluators import CoherenceEvaluator, OutputEvaluator
from strands_evals.providers import CloudWatchProvider

provider = CloudWatchProvider(
    log_group="/aws/bedrock-agentcore/runtimes/my-agent-abc123-DEFAULT",
    region="us-east-1",
)

def task(case: Case) -> dict:
    return provider.get_evaluation_data(case.input)

cases = [Case(name="session_1", input="my-session-id", expected_output="any")]
experiment = Experiment(cases=cases, evaluators=[OutputEvaluator(...), CoherenceEvaluator()])
reports = experiment.run_evaluations(task)

What's included

CloudWatchProvider (src/strands_evals/providers/cloudwatch_provider.py)

Implements the TraceProvider.get_evaluation_data(session_id) interface
Queries CloudWatch Logs Insights using attributes.session.id to fetch OTEL log records
Supports two initialization modes: explicit log_group or automatic discovery via agent_name
Configurable lookback_days (default 30) and query_timeout_seconds (default 60)
Polls query results with exponential backoff (0.5s → 8s max)

CloudWatchSessionMapper (src/strands_evals/mappers/cloudwatch_session_mapper.py)

Converts raw CW Logs Insights JSON records into Session → Trace → Span hierarchy
Groups records by traceId, sorts by timeUnixNano
Handles the double-encoded content format used by AgentCore runtime logs (e.g., body.input.messages[].content.content contains JSON-encoded arrays)
Produces three span types from each record:
- AgentInvocationSpan — user prompt, agent response, available tools
- InferenceSpan — full message list (user + assistant messages)
- ToolExecutionSpan — one per tool call, matched to tool results by toolCallId

Related Issues

#140

Documentation PR

Added src/strands_evals/providers/README.md with quick start, core API, error handling, and custom provider examples. Documentation in strandsagents.com will come next

Type of Change

New feature

Testing

Unit tests (43 tests)

tests/strands_evals/providers/test_cloudwatch_provider.py — 25 tests covering:

Constructor: explicit log_group, agent_name discovery, region resolution (AWS_REGION vs AWS_DEFAULT_REGION vs explicit), error cases
CW Logs Insights polling: happy path, intermediate statuses, failed/timeout/empty, JSON parsing
get_evaluation_data: session lookup, multi-trace sessions, output extraction from last AgentInvocationSpan, error propagation

tests/strands_evals/mappers/test_cloudwatch_session_mapper.py — 11 tests covering:

Span conversion: inference spans, agent invocation spans, tool execution spans, tool call/result matching by ID
Session building: multi-record grouping by trace_id, empty/malformed record handling
Double-encoded content parsing

tests/strands_evals/providers/test_trace_provider.py — 7 tests (updated to remove TraceNotFoundError references)

Integration tests

tests_integ/test_cloudwatch_provider.py — 11 tests against real CloudWatch data (account 249746592913):

Session fetching, trace/span structure validation, span type verification
End-to-end: CloudWatch → OutputEvaluator, CoherenceEvaluator, HelpfulnessEvaluator pipeline

Manual verification

Ran against two live production sessions from the github_issue_handler agent:

github_issue_1764_20260225_162526_f306fd58 (10 spans, 1 trace) — Output: 1.0, Coherence: 1.0, Helpfulness: 0.833
github_issue_1760_20260224_201557_ce6edb42 (6 spans, 1 trace) — Output: 1.0, Coherence: 1.0, Helpfulness: 0.833
I ran hatch run prepare

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Introduce an abstract TraceProvider base class for retrieving agent trace data from observability backends for evaluation. This includes: - TraceProvider ABC with get_session, list_sessions, and get_session_by_trace_id methods - SessionFilter dataclass for filtering session discovery - Custom error hierarchy (TraceProviderError, SessionNotFoundError, TraceNotFoundError, ProviderError) - Session and Trace data types with span tree construction and convenience accessors (input/output messages, token usage, duration) - New providers module exposed at package level - Comprehensive unit tests for providers and trace types

Add abstract TraceProvider that retrieves agent trace data from observability backends and returns Session/Trace types the evals system already consumes. - TraceProvider ABC with get_session() (required), list_sessions() and get_session_by_trace_id() (optional, raise NotImplementedError) - SessionFilter dataclass for time-range and limit-based discovery - Exception hierarchy: TraceProviderError base with SessionNotFoundError, TraceNotFoundError, ProviderError - Export providers module from strands_evals package

Implement LangfuseProvider that fetches agent traces from Langfuse and converts them to Session objects for the evals pipeline. Supports session-level and trace-level retrieval with paginated API calls. - get_evaluation_data(): fetch traces by session ID, convert Langfuse observations to typed spans (InferenceSpan, ToolExecutionSpan, AgentInvocationSpan), extract output from last agent invocation - list_sessions(): paginated session discovery with time-range filtering - get_evaluation_data_by_trace_id(): single trace retrieval - Host resolution: explicit param > LANGFUSE_HOST env var > cloud default - 30 unit tests (mocked SDK), 15 integration tests (real Langfuse + evaluators)

Add TraceProvider interface and implementations for fetching agent execution data from observability backends (CloudWatch Logs and Langfuse). This enables running evaluators against production/staging traces without re-executing agents. - Add CloudWatchProvider for Bedrock AgentCore runtime logs - Add LangfuseProvider for Langfuse-hosted traces - Add TraceProvider base interface with get_evaluation_data API - Add comprehensive test suites for both providers - Add providers README with usage documentation - Fix minor whitespace issues in CoherenceEvaluator docstring

Add a new session mapper that converts CloudWatch Logs Insights OTEL log records into typed Session objects for evaluation. The mapper parses body.input/output messages, extracts tool calls/results, and builds InferenceSpan, ToolExecutionSpan, and AgentInvocationSpan instances grouped by traceId. Includes comprehensive unit tests.

- Export CloudWatchProvider via lazy-loading in providers __init__.py - Add comprehensive docstring to CloudWatchProvider.__init__ with usage examples and parameter descriptions - Extract shared CloudWatch test helpers into a reusable cloudwatch_helpers module to reduce duplication across test files

src/strands_evals/mappers/cloudwatch_session_mapper.py

src/strands_evals/providers/cloudwatch_provider.py

poshinchen · 2026-02-27T20:09:08Z

Discussed offline, we'll test the latest semantic conventions as a follow-up after this change has been merged.

afarntrog added 9 commits February 9, 2026 14:18

Merge branch 'main' into trace_provider

47261a6

Merge branch 'main' into trace_provider_cloudwatch

d1e500a

afarntrog had a problem deploying to auto-approve February 26, 2026 15:19 — with GitHub Actions Failure

afarntrog had a problem deploying to auto-approve February 26, 2026 16:47 — with GitHub Actions Failure

afarntrog requested a review from poshinchen February 27, 2026 18:05

poshinchen reviewed Feb 27, 2026

View reviewed changes

src/strands_evals/mappers/cloudwatch_session_mapper.py Show resolved Hide resolved

poshinchen reviewed Feb 27, 2026

View reviewed changes

src/strands_evals/providers/cloudwatch_provider.py Show resolved Hide resolved

poshinchen approved these changes Feb 27, 2026

View reviewed changes

afarntrog deployed to auto-approve February 27, 2026 20:16 — with GitHub Actions Active

afarntrog merged commit 12a2b8d into strands-agents:main Feb 27, 2026
14 of 45 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add CloudWatchProvider to pull remote cloudwatch traces and run evals against them.#147

feat: add CloudWatchProvider to pull remote cloudwatch traces and run evals against them.#147
afarntrog merged 10 commits intostrands-agents:mainfrom
afarntrog:trace_provider_cloudwatch

afarntrog commented Feb 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

poshinchen commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

afarntrog commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What's included

Related Issues

Documentation PR

Type of Change

Testing

Unit tests (43 tests)

Integration tests

Manual verification

Checklist

Uh oh!

Uh oh!

Uh oh!

poshinchen commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

afarntrog commented Feb 26, 2026 •

edited

Loading