-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Problem Statement
The current TraceProvider exposes a single required method — get_evaluation_data(session_id) — which works well when a customer already knows which session to evaluate. However, there's no way to discover sessions or evaluate them in bulk through the framework. Customers who want to answer "how did my agent perform across all sessions this week?" have to write their own glue code to list sessions from their observability backend, iterate over them, build Case objects, wire up an Experiment, and run evaluations.
We want to extend TraceProvider with optional methods for session discovery and trace-level retrieval, and provide a batch evaluation function that composes these into a single call.
Proposed Changes to TraceProvider
Add the following optional (non-abstract) methods and supporting types:
@dataclass
class SessionFilter:
"""Filter criteria for discovering sessions.
Universal fields are defined here. Provider-specific parameters
go in `additional_fields`.
"""
start_time: datetime | None = None
end_time: datetime | None = None
limit: int | None = None
additional_fields: dict[str, Any] = field(default_factory=dict)def list_sessions(
self,
session_filter: SessionFilter | None = None,
) -> Iterator[str]:
"""Discover session IDs matching filter criteria.
Returns session IDs that can be fed to get_evaluation_data().
Not abstract — providers override to enable session discovery.
Args:
session_filter: Optional filter. If None, provider-specific defaults apply.
Yields:
Session ID strings
Raises:
NotImplementedError: If the provider does not support session discovery
ProviderError: If the provider is unreachable or returns an error
"""
raise NotImplementedError(
"This provider does not support session discovery. "
"Use get_evaluation_data() with a known session_id instead."
)This method is not abstract — providers only implement it if their backend supports the capability. The default raises NotImplementedError with a helpful message.
Batch Evaluation Function
A new evaluate_sessions function that composes list_sessions + get_evaluation_data + Experiment:
# src/strands_evals/batch.py
def evaluate_sessions(
provider: TraceProvider,
evaluators: list[Evaluator],
session_filter: SessionFilter | None = None,
) -> list[EvaluationReport]:
"""Discover sessions from a provider and evaluate them all."""
cases = []
for session_id in provider.list_sessions(session_filter):
cases.append(
Case(
name=session_id,
input=session_id,
expected_output=None,
)
)
experiment = Experiment(cases=cases, evaluators=evaluators)
def task(case: Case) -> dict:
return provider.get_evaluation_data(case.input)
return experiment.run_evaluations(task)Use Case
A customer using Langfuse to trace their Strands agent wants to run a nightly quality check:
from strands_evals.batch import evaluate_sessions
from strands_evals.evaluators.coherence_evaluator import CoherenceEvaluator
from strands_evals.providers.langfuse_provider import LangfuseProvider
from strands_evals.providers.trace_provider import SessionFilter
from datetime import datetime, timedelta
provider = LangfuseProvider(host="https://langfuse.example.com")
reports = evaluate_sessions(
provider=provider,
evaluators=[CoherenceEvaluator()],
session_filter=SessionFilter(
start_time=datetime.now() - timedelta(hours=24),
),
)
print(f"Evaluated {len(reports[0].cases)} sessions")
print(f"Average coherence: {reports[0].overall_score:.2f}")Without this, the customer has to manually call list_sessions(), iterate, build Cases, create an Experiment, define a task function, and call run_evaluations — roughly 15-20 lines of boilerplate that's identical for every provider.
Alternative Solutions
-
Make
list_sessionsabstract — Forces every provider to implement it even if their backend doesn't support session listing (e.g., a provider that only accepts known session IDs from an external source). Rejected because it raises the implementation bar unnecessarily. -
Separate ABCs (e.g.,
SessionDiscoverable,TraceLevelRetrievablemixins) — More type-safe but heavier. With a small number of providers, the complexity isn't justified. Could revisit if the provider ecosystem grows. -
Put batch logic on
Experiment(e.g.,Experiment.from_provider(provider, filter)) — CouplesExperimenttoTraceProvider. A free function keeps them composable without coupling. -
Put batch logic on
TraceProvider— Would requireTraceProviderto know aboutExperimentandEvaluator, creating a circular dependency. Rejected.
Additional Context
No response