feat: add CloudWatchProvider to pull remote cloudwatch traces and run evals against them.#147
Merged
afarntrog merged 10 commits intostrands-agents:mainfrom Feb 27, 2026
Merged
Conversation
Introduce an abstract TraceProvider base class for retrieving agent trace data from observability backends for evaluation. This includes: - TraceProvider ABC with get_session, list_sessions, and get_session_by_trace_id methods - SessionFilter dataclass for filtering session discovery - Custom error hierarchy (TraceProviderError, SessionNotFoundError, TraceNotFoundError, ProviderError) - Session and Trace data types with span tree construction and convenience accessors (input/output messages, token usage, duration) - New providers module exposed at package level - Comprehensive unit tests for providers and trace types
Add abstract TraceProvider that retrieves agent trace data from observability backends and returns Session/Trace types the evals system already consumes. - TraceProvider ABC with get_session() (required), list_sessions() and get_session_by_trace_id() (optional, raise NotImplementedError) - SessionFilter dataclass for time-range and limit-based discovery - Exception hierarchy: TraceProviderError base with SessionNotFoundError, TraceNotFoundError, ProviderError - Export providers module from strands_evals package
Add abstract TraceProvider that retrieves agent trace data from observability backends and returns Session/Trace types the evals system already consumes. - TraceProvider ABC with get_session() (required), list_sessions() and get_session_by_trace_id() (optional, raise NotImplementedError) - SessionFilter dataclass for time-range and limit-based discovery - Exception hierarchy: TraceProviderError base with SessionNotFoundError, TraceNotFoundError, ProviderError - Export providers module from strands_evals package
Implement LangfuseProvider that fetches agent traces from Langfuse and converts them to Session objects for the evals pipeline. Supports session-level and trace-level retrieval with paginated API calls. - get_evaluation_data(): fetch traces by session ID, convert Langfuse observations to typed spans (InferenceSpan, ToolExecutionSpan, AgentInvocationSpan), extract output from last agent invocation - list_sessions(): paginated session discovery with time-range filtering - get_evaluation_data_by_trace_id(): single trace retrieval - Host resolution: explicit param > LANGFUSE_HOST env var > cloud default - 30 unit tests (mocked SDK), 15 integration tests (real Langfuse + evaluators)
Add TraceProvider interface and implementations for fetching agent execution data from observability backends (CloudWatch Logs and Langfuse). This enables running evaluators against production/staging traces without re-executing agents. - Add CloudWatchProvider for Bedrock AgentCore runtime logs - Add LangfuseProvider for Langfuse-hosted traces - Add TraceProvider base interface with get_evaluation_data API - Add comprehensive test suites for both providers - Add providers README with usage documentation - Fix minor whitespace issues in CoherenceEvaluator docstring
Add a new session mapper that converts CloudWatch Logs Insights OTEL log records into typed Session objects for evaluation. The mapper parses body.input/output messages, extracts tool calls/results, and builds InferenceSpan, ToolExecutionSpan, and AgentInvocationSpan instances grouped by traceId. Includes comprehensive unit tests.
Add a new session mapper that converts CloudWatch Logs Insights OTEL log records into typed Session objects for evaluation. The mapper parses body.input/output messages, extracts tool calls/results, and builds InferenceSpan, ToolExecutionSpan, and AgentInvocationSpan instances grouped by traceId. Includes comprehensive unit tests.
- Export CloudWatchProvider via lazy-loading in providers __init__.py - Add comprehensive docstring to CloudWatchProvider.__init__ with usage examples and parameter descriptions - Extract shared CloudWatch test helpers into a reusable cloudwatch_helpers module to reduce duplication across test files
poshinchen
reviewed
Feb 27, 2026
poshinchen
reviewed
Feb 27, 2026
poshinchen
approved these changes
Feb 27, 2026
Contributor
|
Discussed offline, we'll test the latest semantic conventions as a follow-up after this change has been merged. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add
CloudWatchProvider— aTraceProviderimplementation that fetches agent execution traces from AWS CloudWatch Logs (Bedrock AgentCore runtime logs) and converts them into the evaluation pipeline'sSessionformat.This enables running evaluators (OutputEvaluator, CoherenceEvaluator, HelpfulnessEvaluator, etc.) against production agent traces without re-executing the agent. For example:
What's included
CloudWatchProvider(src/strands_evals/providers/cloudwatch_provider.py)TraceProvider.get_evaluation_data(session_id)interfaceattributes.session.idto fetch OTEL log recordslog_groupor automatic discovery viaagent_namelookback_days(default 30) andquery_timeout_seconds(default 60)CloudWatchSessionMapper(src/strands_evals/mappers/cloudwatch_session_mapper.py)Session→Trace→SpanhierarchytraceId, sorts bytimeUnixNanobody.input.messages[].content.contentcontains JSON-encoded arrays)AgentInvocationSpan— user prompt, agent response, available toolsInferenceSpan— full message list (user + assistant messages)ToolExecutionSpan— one per tool call, matched to tool results bytoolCallIdRelated Issues
#140
Documentation PR
Added
src/strands_evals/providers/README.mdwith quick start, core API, error handling, and custom provider examples. Documentation in strandsagents.com will come nextType of Change
New feature
Testing
Unit tests (43 tests)
tests/strands_evals/providers/test_cloudwatch_provider.py— 25 tests covering:get_evaluation_data: session lookup, multi-trace sessions, output extraction from last AgentInvocationSpan, error propagationtests/strands_evals/mappers/test_cloudwatch_session_mapper.py— 11 tests covering:tests/strands_evals/providers/test_trace_provider.py— 7 tests (updated to remove TraceNotFoundError references)Integration tests
tests_integ/test_cloudwatch_provider.py— 11 tests against real CloudWatch data (account 249746592913):Manual verification
Ran against two live production sessions from the
github_issue_handleragent:github_issue_1764_20260225_162526_f306fd58(10 spans, 1 trace) — Output: 1.0, Coherence: 1.0, Helpfulness: 0.833github_issue_1760_20260224_201557_ce6edb42(6 spans, 1 trace) — Output: 1.0, Coherence: 1.0, Helpfulness: 0.833I ran
hatch run prepareChecklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.