feat: add OTel test semantic convention attributes to Experiment spans#131
feat: add OTel test semantic convention attributes to Experiment spans#131anirudha wants to merge 1 commit intostrands-agents:mainfrom
Conversation
poshinchen
left a comment
There was a problem hiding this comment.
Now I remember why reverted this grouping logic....
In strands-evals we have the ActorSimulator which allows user to execute multi-turn conversation for evaluations.
With this current changes, the multi-turn conversation will be wrapped together into a single case span. But in general those are different requests, and it contradicts what other OTEL-supported backend (Langfuse / Jaeger...) has. Users should group the traces based on the session_id, instead of wrapping all the executions into one span.
I had a long discussion with @jjbuck and this was the final decision that we made.
We can debate it again to whether group all of the multi-turn conversation into a single span. But to start simple, I'll be fine with having those test* and so on as the attributes. I think they are from OTEL Test Attributes?
poshinchen
left a comment
There was a problem hiding this comment.
And I also intentionally made them into span_attributes instead of event_attributes because some sources do not support event.
And from the documentation, it seems like they can both be span_attributes and the event_attributes.
Add test.* span attributes from the OTel semantic conventions proposal to
existing evaluation spans, improving observability and interoperability
with OTel-compatible backends.
Changes:
- Add optional 'name' parameter to Experiment (default: 'unnamed_experiment')
with serialization round-trip support in to_dict/from_dict/from_file
- Add test.suite.name, test.suite.run.id, test.case.name, test.case.id
attributes to eval_case (sync) and execute_case (async) spans
- Add test.case.result.status ('pass'/'fail') to evaluator spans
- Generate unique UUID4 run_id per run_evaluations/run_evaluations_async call
- Add hypothesis to test dependencies in pyproject.toml
- Add property-based tests (Properties 1-6) and unit tests for all new
functionality, backward compatibility, and edge cases
All changes are additive - no wrapper spans introduced, no OTel events used,
all existing gen_ai.evaluation.* attributes preserved unchanged.
82d2ff7 to
5787c93
Compare
| queue: Queue containing cases to process | ||
| task: Task function to run on each case | ||
| results: List to store results | ||
| run_id: Unique identifier for this evaluation run |
There was a problem hiding this comment.
Isn't this the same as session.id?
| """ | ||
| queue: asyncio.Queue[Case[InputT, OutputT]] = asyncio.Queue() | ||
| results: list[Any] = [] | ||
| run_id = str(uuid.uuid4()) |
poshinchen
left a comment
There was a problem hiding this comment.
Do you really need test_experiment_otel_conventions.py? =
It's basically mocks and verify the mock spans. This can reduce the need of hypothesis dependency
|
out on vacation, i'll address the comments next week |
Description
Adds
test.*span attributes from the OTel test semantic conventions proposal to existingExperimentevaluation spans. All changes are strictly additive: no wrapper spans, no OTel events, all existinggen_ai.evaluation.*attributes preserved unchanged.Related: open-telemetry/semantic-conventions#3398
What changed
src/strands_evals/experiment.py(+34 lines)nameparameterstron__init__, defaults to"unnamed_experiment". Stored asself._name, exposed via@property.to_dict/from_dictname.from_dictfalls back to"unnamed_experiment"for legacy dicts without the key.import uuidrun_evaluations(sync)run_id = str(uuid.uuid4())at method start. Addstest.suite.name,test.suite.run.id,test.case.name,test.case.idtoeval_casespan initial attributes. Addstest.case.result.statusto each evaluator span'sset_attributes.run_evaluations_asyncrun_idgeneration. Passesrun_idto_worker._worker(async)run_idparameter. Adds sametest.*attributes toexecute_caseand evaluator spans.No new classes, modules, or architectural changes. The diff is ~34 lines of production code.
pyproject.toml(+3 lines)Added
hypothesis>=6.0.0,<7.0.0to three dependency sections:[project.optional-dependencies] test[tool.hatch.envs.hatch-test] dependencies[tool.hatch.envs.default] dependenciestests/strands_evals/test_experiment.py(+11 lines)Updated 11 existing
to_dicttest assertions to include the new"name": "unnamed_experiment"key in expected dictionaries. No test logic changed.tests/strands_evals/test_experiment_otel_conventions.py(new, 741 lines)6 property-based tests (hypothesis, 100 iterations each) + 10 unit tests:
Experiment(name=s).name == sfor any stringto_dict→from_dictround-trip preserves nameeval_casespans have all 4test.*attributes with correct valuesexecute_casespans have all 4test.*attributes with correct valuestest.case.result.statusmatchesaggregate_passbooleangen_ai.evaluation.*attributes preserved on async spansnameproduce identical evaluation reports"unnamed_experiment"whennamenot providednamekey defaults correctlytest_suite_runspan created (sync + async)test.*data (sync + async)test.suite.run.idis valid UUID4 (sync + async)src/strands_evals/evaluators/coherence_evaluator.py(whitespace only)Trailing whitespace cleanup on 2 docstring lines. Likely from formatter.
Design decisions worth reviewing
No wrapper span:
run_idis a flat attribute on each case span rather than derived from a parenttest_suite_runspan. This preserves the flat trace structure that backends like Langfuse/Jaeger expect forsession_id-based grouping (important forActorSimulatormulti-turn conversations).Span attributes, not events: All
test.*metadata usesspan.set_attributes(). Maximizes backend compatibility since not all backends support event attributes.run_idper invocation, not per instance: Each call torun_evaluations/run_evaluations_asyncgets a fresh UUID4. Concurrent calls on the sameExperimentinstance get distinct IDs.Backward compatibility:
namedefaults to"unnamed_experiment",from_dicthandles missing key gracefully. Constructor accepts all previously valid argument combinations.Span attribute schema
How to verify
hatch test tests/ -vvAll 400 tests pass (69 existing + 17 new OTel convention tests + 314 other project tests).