Skip to content

Releases: strands-agents/evals

v0.1.8

25 Feb 20:08
d335cb0

Choose a tag to compare

What's Changed

Full Changelog: v0.1.7...v0.1.8

v0.1.7

19 Feb 17:46
d307f40

Choose a tag to compare

What's Changed

  • fix: retrieve multiple text contentBlock in messageConent by @poshinchen in #133
  • feat(workflows): add conventional commit workflow in PR by @mkmeral in #134
  • fix: add tool info to concisenss, harmfulness, helpfulness and response relevance evaluators by @ybdarrenwang in #132
  • fix: update output variable name in workflow by @Unshure in #139
  • fix: update finalize condition for workflow execution by @Unshure in #142

New Contributors

Full Changelog: v0.1.6...v0.1.7

v0.1.6

11 Feb 20:20
fc30eb8

Choose a tag to compare

What's Changed

Full Changelog: v0.1.5...v0.1.6

v0.1.5

05 Feb 21:18
9be0604

Choose a tag to compare

Major Features

Response Relevance Evaluator - PR#112

The new ResponseRelevanceEvaluator measures how well an agent's response addresses the user's question. It uses a 5-level LLM-as-judge scoring system — Not At All (0.0), Not Generally (0.25), Neutral/Mixed (0.5), Generally Yes (0.75), and Completely Yes (1.0) — with a pass threshold at ≥0.5. Like other trace-level evaluators, it requires an actual_trajectory session and supports both sync and async evaluation.

from strands_evals.evaluators import ResponseRelevanceEvaluator

evaluator = ResponseRelevanceEvaluator()
results = evaluator.evaluate(evaluation_data)

# results[0].score    -> 1.0  (for COMPLETELY_YES)
# results[0].test_pass -> True (score >= 0.5)
# results[0].reason   -> "The response directly answers the question."
# results[0].label    -> ResponseRelevanceScore.COMPLETELY_YES

Conciseness Evaluator - PR#115

The new ConcisenessEvaluator assesses whether an agent's response is appropriately concise. It uses a 3-level scoring system: Perfectly Concise (1.0) for responses that deliver exactly what was asked, Partially Concise (0.5) for minor extra wording, and Not Concise (0.0) for verbose or repetitive content. The pass threshold is ≥0.5. Both evaluators accept an optional custom model and system_prompt for the LLM judge.

from strands_evals.evaluators import ConcisenessEvaluator

evaluator = ConcisenessEvaluator()
results = evaluator.evaluate(evaluation_data)

# results[0].score    -> 0.0  (for NOT_CONCISE)
# results[0].test_pass -> False (score < 0.5)
# results[0].label    -> ConcisenessScore.NOT_CONCISE

Automatic Retry with Exponential Backoff for Throttled Evaluations - PR#107

Experiment.run_evaluations() and run_evaluations_async() now automatically retry on throttling errors using tenacity. Both task execution and evaluator execution are wrapped with exponential backoff (up to 6 attempts, 4s → 240s). Throttling is detected for ModelThrottledException, EventLoopException, and botocore ClientError with codes like ThrottlingException and TooManyRequestsException. Non-throttling errors are raised immediately without retrying. Additionally, one evaluator failing no longer prevents other evaluators from running on the same case.

from strands_evals import Case, Experiment
from strands_evals.evaluators import OutputEvaluator, ConcisenessEvaluator

experiment = Experiment(
    cases=[Case(name="test", input="What is 2+2?", expected_output="4")],
    evaluators=[
        OutputEvaluator(rubric="Is the answer correct?"),
        ConcisenessEvaluator(),
    ],
)

# Throttling errors are retried automatically with exponential backoff.
# If one evaluator fails, the other still runs.
reports = experiment.run_evaluations(my_agent)

Bug Fixes

  • Replace Deprecated Structured Output Methods - PR#67
    Updated OutputEvaluator to call the agent with the structured_output_model parameter instead of the removed structured_output() and structured_output_async() methods. Without this fix, OutputEvaluator would fail on recent Strands SDK versions.

Full Changelog: v0.1.4...v0.1.5

v0.1.4

29 Jan 01:53
5287ac1

Choose a tag to compare

What's Changed

  • fix: include tool executions in _extract_trace_level by @razkenari in #77

New Contributors

Full Changelog: v0.1.3...v0.1.4

v0.1.3

21 Jan 21:05
9c643b6

Choose a tag to compare

What's Changed

  • fix: Multiple Tool Usage Not Detected in tools_use_extractor.py by @bipro1992 in #80

New Contributors

Full Changelog: v0.1.2...v0.1.3

v0.1.2

13 Jan 23:18
47ca78d

Choose a tag to compare

What's Changed

  • fix: Isolate evaluator errors in run_evaluations by @afarntrog in #84
  • fix(extractors): Add null check for toolResult in message extraction by @afarntrog in #85

Full Changelog: v0.1.1...v0.1.2

v0.1.1

15 Dec 22:24
049298d

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.1.0...v0.1.1

v0.1.0

03 Dec 17:34
5112ed5

Choose a tag to compare

Strands Evaluation is a powerful framework for evaluating AI agents and LLM applications. From simple output validation to complex multi-agent interaction analysis, trajectory evaluation, and automated experiment generation, Strands Evaluation provides comprehensive tools to measure and improve your AI systems.

Feature Overview

  • Multiple Evaluation Types: Output evaluation, trajectory analysis, tool usage assessment, and interaction evaluation
  • LLM-as-a-Judge: Built-in evaluators using language models for sophisticated assessment with structured scoring
  • Trace-based Evaluation: Analyze agent behavior through OpenTelemetry execution traces
  • Automated Experiment Generation: Generate comprehensive test suites from context descriptions
  • Custom Evaluators: Extensible framework for domain-specific evaluation logic
  • Experiment Management: Save, load, and version your evaluation experiments with JSON serialization
  • Built-in Scoring Tools: Helper functions for exact, in-order, and any-order trajectory matching
  • Simulators: Enable multi-turn evaluation of conversational agents by generating realistic interaction patterns that adapt based on agent responses to create authentic evaluation scenarios