Skip to content

feat: add OTel test semantic convention attributes to Experiment spans#131

Draft
anirudha wants to merge 1 commit intostrands-agents:mainfrom
anirudha:feat/otel-test-semantic-conventions
Draft

feat: add OTel test semantic convention attributes to Experiment spans#131
anirudha wants to merge 1 commit intostrands-agents:mainfrom
anirudha:feat/otel-test-semantic-conventions

Conversation

@anirudha
Copy link

@anirudha anirudha commented Feb 10, 2026

Description

Adds test.* span attributes from the OTel test semantic conventions proposal to existing Experiment evaluation spans. All changes are strictly additive: no wrapper spans, no OTel events, all existing gen_ai.evaluation.* attributes preserved unchanged.

Related: open-telemetry/semantic-conventions#3398

What changed

src/strands_evals/experiment.py (+34 lines)

Change Details
name parameter Optional str on __init__, defaults to "unnamed_experiment". Stored as self._name, exposed via @property.
to_dict / from_dict Serializes and restores name. from_dict falls back to "unnamed_experiment" for legacy dicts without the key.
import uuid Added at module top.
run_evaluations (sync) Generates run_id = str(uuid.uuid4()) at method start. Adds test.suite.name, test.suite.run.id, test.case.name, test.case.id to eval_case span initial attributes. Adds test.case.result.status to each evaluator span's set_attributes.
run_evaluations_async Same run_id generation. Passes run_id to _worker.
_worker (async) New run_id parameter. Adds same test.* attributes to execute_case and evaluator spans.

No new classes, modules, or architectural changes. The diff is ~34 lines of production code.

pyproject.toml (+3 lines)

Added hypothesis>=6.0.0,<7.0.0 to three dependency sections:

  • [project.optional-dependencies] test
  • [tool.hatch.envs.hatch-test] dependencies
  • [tool.hatch.envs.default] dependencies

tests/strands_evals/test_experiment.py (+11 lines)

Updated 11 existing to_dict test assertions to include the new "name": "unnamed_experiment" key in expected dictionaries. No test logic changed.

tests/strands_evals/test_experiment_otel_conventions.py (new, 741 lines)

6 property-based tests (hypothesis, 100 iterations each) + 10 unit tests:

Test What it validates
Property 1 Experiment(name=s).name == s for any string
Property 2 to_dictfrom_dict round-trip preserves name
Property 3 (sync) eval_case spans have all 4 test.* attributes with correct values
Property 3 (async) execute_case spans have all 4 test.* attributes with correct values
Property 4 Evaluator span test.case.result.status matches aggregate_pass boolean
Property 5 Existing gen_ai.evaluation.* attributes preserved on async spans
Property 6 Experiments with/without name produce identical evaluation reports
Unit: default name "unnamed_experiment" when name not provided
Unit: legacy from_dict Dict without name key defaults correctly
Unit: no wrapper span No test_suite_run span created (sync + async)
Unit: no add_event No OTel events used for test.* data (sync + async)
Unit: run_id format test.suite.run.id is valid UUID4 (sync + async)

src/strands_evals/evaluators/coherence_evaluator.py (whitespace only)

Trailing whitespace cleanup on 2 docstring lines. Likely from formatter.

Design decisions worth reviewing

  1. No wrapper span: run_id is a flat attribute on each case span rather than derived from a parent test_suite_run span. This preserves the flat trace structure that backends like Langfuse/Jaeger expect for session_id-based grouping (important for ActorSimulator multi-turn conversations).

  2. Span attributes, not events: All test.* metadata uses span.set_attributes(). Maximizes backend compatibility since not all backends support event attributes.

  3. run_id per invocation, not per instance: Each call to run_evaluations/run_evaluations_async gets a fresh UUID4. Concurrent calls on the same Experiment instance get distinct IDs.

  4. Backward compatibility: name defaults to "unnamed_experiment", from_dict handles missing key gracefully. Constructor accepts all previously valid argument combinations.

Span attribute schema

eval_case <name>  /  execute_case <name>
├── test.suite.name      = experiment.name
├── test.suite.run.id    = UUID4 (unique per run_evaluations call)
├── test.case.name       = case.name
├── test.case.id         = case.session_id
├── gen_ai.evaluation.*  = (unchanged)
│
└── evaluator <Name>
    ├── test.case.result.status = "pass" | "fail"
    └── gen_ai.evaluation.*     = (unchanged)

How to verify

hatch test tests/ -vv

All 400 tests pass (69 existing + 17 new OTel convention tests + 314 other project tests).

@anirudha anirudha marked this pull request as draft February 10, 2026 15:21
Copy link
Contributor

@poshinchen poshinchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I remember why reverted this grouping logic....

In strands-evals we have the ActorSimulator which allows user to execute multi-turn conversation for evaluations.

With this current changes, the multi-turn conversation will be wrapped together into a single case span. But in general those are different requests, and it contradicts what other OTEL-supported backend (Langfuse / Jaeger...) has. Users should group the traces based on the session_id, instead of wrapping all the executions into one span.

I had a long discussion with @jjbuck and this was the final decision that we made.

We can debate it again to whether group all of the multi-turn conversation into a single span. But to start simple, I'll be fine with having those test* and so on as the attributes. I think they are from OTEL Test Attributes?

Copy link
Contributor

@poshinchen poshinchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I also intentionally made them into span_attributes instead of event_attributes because some sources do not support event.

And from the documentation, it seems like they can both be span_attributes and the event_attributes.

Add test.* span attributes from the OTel semantic conventions proposal to
existing evaluation spans, improving observability and interoperability
with OTel-compatible backends.

Changes:
- Add optional 'name' parameter to Experiment (default: 'unnamed_experiment')
  with serialization round-trip support in to_dict/from_dict/from_file
- Add test.suite.name, test.suite.run.id, test.case.name, test.case.id
  attributes to eval_case (sync) and execute_case (async) spans
- Add test.case.result.status ('pass'/'fail') to evaluator spans
- Generate unique UUID4 run_id per run_evaluations/run_evaluations_async call
- Add hypothesis to test dependencies in pyproject.toml
- Add property-based tests (Properties 1-6) and unit tests for all new
  functionality, backward compatibility, and edge cases

All changes are additive - no wrapper spans introduced, no OTel events used,
all existing gen_ai.evaluation.* attributes preserved unchanged.
@anirudha anirudha force-pushed the feat/otel-test-semantic-conventions branch from 82d2ff7 to 5787c93 Compare February 11, 2026 15:18
@anirudha anirudha changed the title feat: align Experiment telemetry with OTel test semantic conventions feat: add OTel test semantic convention attributes to Experiment spans Feb 11, 2026
queue: Queue containing cases to process
task: Task function to run on each case
results: List to store results
run_id: Unique identifier for this evaluation run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same as session.id?

"""
queue: asyncio.Queue[Case[InputT, OutputT]] = asyncio.Queue()
results: list[Any] = []
run_id = str(uuid.uuid4())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor

@poshinchen poshinchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need test_experiment_otel_conventions.py? =

It's basically mocks and verify the mock spans. This can reduce the need of hypothesis dependency

@anirudha
Copy link
Author

anirudha commented Feb 18, 2026

out on vacation, i'll address the comments next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants