feat: add OTel test semantic convention attributes to Experiment spans by anirudha · Pull Request #131 · strands-agents/evals

anirudha · 2026-02-10T15:12:39Z

Description

Adds test.* span attributes from the OTel test semantic conventions proposal to existing Experiment evaluation spans. All changes are strictly additive: no wrapper spans, no OTel events, all existing gen_ai.evaluation.* attributes preserved unchanged.

Related: open-telemetry/semantic-conventions#3398

What changed

`src/strands_evals/experiment.py` (+34 lines)

Change	Details
`name` parameter	Optional `str` on `__init__`, defaults to `"unnamed_experiment"`. Stored as `self._name`, exposed via `@property`.
`to_dict` / `from_dict`	Serializes and restores `name`. `from_dict` falls back to `"unnamed_experiment"` for legacy dicts without the key.
`import uuid`	Added at module top.
`run_evaluations` (sync)	Generates `run_id = str(uuid.uuid4())` at method start. Adds `test.suite.name`, `test.suite.run.id`, `test.case.name`, `test.case.id` to `eval_case` span initial attributes. Adds `test.case.result.status` to each evaluator span's `set_attributes`.
`run_evaluations_async`	Same `run_id` generation. Passes `run_id` to `_worker`.
`_worker` (async)	New `run_id` parameter. Adds same `test.*` attributes to `execute_case` and evaluator spans.

No new classes, modules, or architectural changes. The diff is ~34 lines of production code.

`pyproject.toml` (+3 lines)

Added hypothesis>=6.0.0,<7.0.0 to three dependency sections:

[project.optional-dependencies] test
[tool.hatch.envs.hatch-test] dependencies
[tool.hatch.envs.default] dependencies

`tests/strands_evals/test_experiment.py` (+11 lines)

Updated 11 existing to_dict test assertions to include the new "name": "unnamed_experiment" key in expected dictionaries. No test logic changed.

`tests/strands_evals/test_experiment_otel_conventions.py` (new, 741 lines)

6 property-based tests (hypothesis, 100 iterations each) + 10 unit tests:

Test	What it validates
Property 1	`Experiment(name=s).name == s` for any string
Property 2	`to_dict` → `from_dict` round-trip preserves name
Property 3 (sync)	`eval_case` spans have all 4 `test.*` attributes with correct values
Property 3 (async)	`execute_case` spans have all 4 `test.*` attributes with correct values
Property 4	Evaluator span `test.case.result.status` matches `aggregate_pass` boolean
Property 5	Existing `gen_ai.evaluation.*` attributes preserved on async spans
Property 6	Experiments with/without `name` produce identical evaluation reports
Unit: default name	`"unnamed_experiment"` when `name` not provided
Unit: legacy from_dict	Dict without `name` key defaults correctly
Unit: no wrapper span	No `test_suite_run` span created (sync + async)
Unit: no add_event	No OTel events used for `test.*` data (sync + async)
Unit: run_id format	`test.suite.run.id` is valid UUID4 (sync + async)

`src/strands_evals/evaluators/coherence_evaluator.py` (whitespace only)

Trailing whitespace cleanup on 2 docstring lines. Likely from formatter.

Design decisions worth reviewing

No wrapper span: run_id is a flat attribute on each case span rather than derived from a parent test_suite_run span. This preserves the flat trace structure that backends like Langfuse/Jaeger expect for session_id-based grouping (important for ActorSimulator multi-turn conversations).
Span attributes, not events: All test.* metadata uses span.set_attributes(). Maximizes backend compatibility since not all backends support event attributes.
run_id per invocation, not per instance: Each call to run_evaluations/run_evaluations_async gets a fresh UUID4. Concurrent calls on the same Experiment instance get distinct IDs.
Backward compatibility: name defaults to "unnamed_experiment", from_dict handles missing key gracefully. Constructor accepts all previously valid argument combinations.

Span attribute schema

eval_case <name>  /  execute_case <name>
├── test.suite.name      = experiment.name
├── test.suite.run.id    = UUID4 (unique per run_evaluations call)
├── test.case.name       = case.name
├── test.case.id         = case.session_id
├── gen_ai.evaluation.*  = (unchanged)
│
└── evaluator <Name>
    ├── test.case.result.status = "pass" | "fail"
    └── gen_ai.evaluation.*     = (unchanged)

How to verify

hatch test tests/ -vv

All 400 tests pass (69 existing + 17 new OTel convention tests + 314 other project tests).

poshinchen

Now I remember why reverted this grouping logic....

In strands-evals we have the ActorSimulator which allows user to execute multi-turn conversation for evaluations.

With this current changes, the multi-turn conversation will be wrapped together into a single case span. But in general those are different requests, and it contradicts what other OTEL-supported backend (Langfuse / Jaeger...) has. Users should group the traces based on the session_id, instead of wrapping all the executions into one span.

I had a long discussion with @jjbuck and this was the final decision that we made.

We can debate it again to whether group all of the multi-turn conversation into a single span. But to start simple, I'll be fine with having those test* and so on as the attributes. I think they are from OTEL Test Attributes?

poshinchen

And I also intentionally made them into span_attributes instead of event_attributes because some sources do not support event.

And from the documentation, it seems like they can both be span_attributes and the event_attributes.

Add test.* span attributes from the OTel semantic conventions proposal to existing evaluation spans, improving observability and interoperability with OTel-compatible backends. Changes: - Add optional 'name' parameter to Experiment (default: 'unnamed_experiment') with serialization round-trip support in to_dict/from_dict/from_file - Add test.suite.name, test.suite.run.id, test.case.name, test.case.id attributes to eval_case (sync) and execute_case (async) spans - Add test.case.result.status ('pass'/'fail') to evaluator spans - Generate unique UUID4 run_id per run_evaluations/run_evaluations_async call - Add hypothesis to test dependencies in pyproject.toml - Add property-based tests (Properties 1-6) and unit tests for all new functionality, backward compatibility, and edge cases All changes are additive - no wrapper spans introduced, no OTel events used, all existing gen_ai.evaluation.* attributes preserved unchanged.

poshinchen · 2026-02-11T16:06:58Z

src/strands_evals/experiment.py

            queue: Queue containing cases to process
            task: Task function to run on each case
            results: List to store results
+            run_id: Unique identifier for this evaluation run


Isn't this the same as session.id?

poshinchen · 2026-02-11T16:07:17Z

src/strands_evals/experiment.py

        """
        queue: asyncio.Queue[Case[InputT, OutputT]] = asyncio.Queue()
        results: list[Any] = []
+        run_id = str(uuid.uuid4())


poshinchen

Do you really need test_experiment_otel_conventions.py? =

It's basically mocks and verify the mock spans. This can reduce the need of hypothesis dependency

anirudha · 2026-02-18T17:14:21Z

out on vacation, i'll address the comments next week

anirudha requested a deployment to manual-approval February 10, 2026 15:12 — with GitHub Actions Waiting

anirudha marked this pull request as draft February 10, 2026 15:21

poshinchen reviewed Feb 10, 2026

View reviewed changes

anirudha force-pushed the feat/otel-test-semantic-conventions branch from 82d2ff7 to 5787c93 Compare February 11, 2026 15:18

anirudha changed the title ~~feat: align Experiment telemetry with OTel test semantic conventions~~ feat: add OTel test semantic convention attributes to Experiment spans Feb 11, 2026

anirudha requested a deployment to manual-approval February 11, 2026 15:19 — with GitHub Actions Waiting

poshinchen reviewed Feb 11, 2026

View reviewed changes

poshinchen requested changes Feb 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add OTel test semantic convention attributes to Experiment spans#131

feat: add OTel test semantic convention attributes to Experiment spans#131
anirudha wants to merge 1 commit intostrands-agents:mainfrom
anirudha:feat/otel-test-semantic-conventions

anirudha commented Feb 10, 2026 •

edited

Loading

Uh oh!

poshinchen left a comment

Uh oh!

poshinchen left a comment

Uh oh!

poshinchen Feb 11, 2026

Uh oh!

poshinchen Feb 11, 2026

Uh oh!

poshinchen left a comment

Uh oh!

anirudha commented Feb 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anirudha commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What changed

src/strands_evals/experiment.py (+34 lines)

pyproject.toml (+3 lines)

tests/strands_evals/test_experiment.py (+11 lines)

tests/strands_evals/test_experiment_otel_conventions.py (new, 741 lines)

src/strands_evals/evaluators/coherence_evaluator.py (whitespace only)

Design decisions worth reviewing

Span attribute schema

How to verify

Uh oh!

poshinchen left a comment

Choose a reason for hiding this comment

Uh oh!

poshinchen left a comment

Choose a reason for hiding this comment

Uh oh!

poshinchen Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

poshinchen Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

poshinchen left a comment

Choose a reason for hiding this comment

Uh oh!

anirudha commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anirudha commented Feb 10, 2026 •

edited

Loading

`src/strands_evals/experiment.py` (+34 lines)

`pyproject.toml` (+3 lines)

`tests/strands_evals/test_experiment.py` (+11 lines)

`tests/strands_evals/test_experiment_otel_conventions.py` (new, 741 lines)

`src/strands_evals/evaluators/coherence_evaluator.py` (whitespace only)

anirudha commented Feb 18, 2026 •

edited

Loading