Skip to content

[FEATURE] Extend TraceProvider #143

@afarntrog

Description

@afarntrog

Problem Statement

The current TraceProvider exposes a single required method — get_evaluation_data(session_id) — which works well when a customer already knows which session to evaluate. However, there's no way to discover sessions or evaluate them in bulk through the framework. Customers who want to answer "how did my agent perform across all sessions this week?" have to write their own glue code to list sessions from their observability backend, iterate over them, build Case objects, wire up an Experiment, and run evaluations.

We want to extend TraceProvider with optional methods for session discovery and trace-level retrieval, and provide a batch evaluation function that composes these into a single call.

Proposed Changes to TraceProvider

Add the following optional (non-abstract) methods and supporting types:

@dataclass
class SessionFilter:
    """Filter criteria for discovering sessions.

    Universal fields are defined here. Provider-specific parameters
    go in `additional_fields`.
    """

    start_time: datetime | None = None
    end_time: datetime | None = None
    limit: int | None = None
    additional_fields: dict[str, Any] = field(default_factory=dict)
def list_sessions(
    self,
    session_filter: SessionFilter | None = None,
) -> Iterator[str]:
    """Discover session IDs matching filter criteria.

    Returns session IDs that can be fed to get_evaluation_data().
    Not abstract — providers override to enable session discovery.

    Args:
        session_filter: Optional filter. If None, provider-specific defaults apply.

    Yields:
        Session ID strings

    Raises:
        NotImplementedError: If the provider does not support session discovery
        ProviderError: If the provider is unreachable or returns an error
    """
    raise NotImplementedError(
        "This provider does not support session discovery. "
        "Use get_evaluation_data() with a known session_id instead."
    )

This method is not abstract — providers only implement it if their backend supports the capability. The default raises NotImplementedError with a helpful message.

Batch Evaluation Function

A new evaluate_sessions function that composes list_sessions + get_evaluation_data + Experiment:

# src/strands_evals/batch.py

def evaluate_sessions(
    provider: TraceProvider,
    evaluators: list[Evaluator],
    session_filter: SessionFilter | None = None,
) -> list[EvaluationReport]:
    """Discover sessions from a provider and evaluate them all."""
    cases = []
    for session_id in provider.list_sessions(session_filter):
        cases.append(
            Case(
                name=session_id,
                input=session_id,
                expected_output=None,
            )
        )

    experiment = Experiment(cases=cases, evaluators=evaluators)

    def task(case: Case) -> dict:
        return provider.get_evaluation_data(case.input)

    return experiment.run_evaluations(task)

Use Case

A customer using Langfuse to trace their Strands agent wants to run a nightly quality check:

from strands_evals.batch import evaluate_sessions
from strands_evals.evaluators.coherence_evaluator import CoherenceEvaluator
from strands_evals.providers.langfuse_provider import LangfuseProvider
from strands_evals.providers.trace_provider import SessionFilter
from datetime import datetime, timedelta

provider = LangfuseProvider(host="https://langfuse.example.com")

reports = evaluate_sessions(
    provider=provider,
    evaluators=[CoherenceEvaluator()],
    session_filter=SessionFilter(
        start_time=datetime.now() - timedelta(hours=24),
    ),
)

print(f"Evaluated {len(reports[0].cases)} sessions")
print(f"Average coherence: {reports[0].overall_score:.2f}")

Without this, the customer has to manually call list_sessions(), iterate, build Cases, create an Experiment, define a task function, and call run_evaluations — roughly 15-20 lines of boilerplate that's identical for every provider.

Alternative Solutions

  1. Make list_sessions abstract — Forces every provider to implement it even if their backend doesn't support session listing (e.g., a provider that only accepts known session IDs from an external source). Rejected because it raises the implementation bar unnecessarily.

  2. Separate ABCs (e.g., SessionDiscoverable, TraceLevelRetrievable mixins) — More type-safe but heavier. With a small number of providers, the complexity isn't justified. Could revisit if the provider ecosystem grows.

  3. Put batch logic on Experiment (e.g., Experiment.from_provider(provider, filter)) — Couples Experiment to TraceProvider. A free function keeps them composable without coupling.

  4. Put batch logic on TraceProvider — Would require TraceProvider to know about Experiment and Evaluator, creating a circular dependency. Rejected.

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions