-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Checks
- I have updated to the lastest minor and patch version of Strands and evals
- I have checked the documentation and this is not expected behavior
- I have searched ./issues and there are no duplicates of my issue
Strands Version
0.1.5
Strands Evals Version
1.24.0
Python Version
3.12.3
Operating System
Ubuntu 24.04.3 LTS
Installation Method
None
Steps to Reproduce
Minimal Repro Script
from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import GoalSuccessRateEvaluator
from strands_evals.mappers import StrandsInMemorySessionMapper
from strands_evals.telemetry import StrandsEvalsTelemetry
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
memory_exporter = telemetry.in_memory_exporter
def task_fn(case):
memory_exporter.clear()
agent = Agent(
trace_attributes={"session.id": case.session_id, "gen_ai.conversation.id": case.session_id},
callback_handler=None,
)
result = agent([
{"document": {"name": "invoice", "format": "pdf", "source": {"bytes": open("invoice.pdf", "rb").read()}}},
{"text": "Analyze this document"},
])
mapper = StrandsInMemorySessionMapper()
session = mapper.map_to_session(memory_exporter.get_finished_spans(), session_id=case.session_id)
# Bug: session.traces[0].spans[0].user_prompt will be "" because
# the mapper does content_list[0].get("text", "") and index 0 is the document block
print(f"user_prompt: '{session.traces[0].spans[0].user_prompt}'")
return {"output": str(result), "trajectory": session}
experiment = Experiment(
cases=[Case(input="Analyze this document")],
evaluators=[GoalSuccessRateEvaluator()],
)
reports = experiment.run_evaluations(task=task_fn)
reports[0].run_display()
Expected Behavior
A multimodal (or any input prompt with multiple content blocks) should be analyzed by the evaluator. Absent that behavior, common enterprise use cases like agentic document processing cannot be tested effectively because the evaluators don't see what the model saw so they can't effectively evaluate on Faithfulness, Goal Success etc...
Actual Behavior
Because
strands_evals/mappers/strands_in_memory_session_mapper.py's helper function logic in _convert_agent_invocation_span which is invoked for each test case to convert the OTEL spans before the evaluators run uses the following logic:
for event in span.events:
try:
event_attributes = event.attributes
if not event_attributes:
continue
if event.name == "gen_ai.user.message":
content_list = self._parse_json_attr(event_attributes, "content")
user_prompt = content_list[0].get("text", "") if content_list else ""
elif event.name == "gen_ai.choice":
msg = event_attributes.get("message", "") if event_attributes else ""
agent_response = str(msg)
except Exception as e:
logger.warning(f"Failed to process agent event {event.name}: {e}")
The user_prompt behavior here explicitly targets the first content block in the message, and only if it's text. So a message that starts with a document or an image returns blank here, and a message with multiple text content blocks will be truncated.
This is either a bug or an undocumented limitation. It is also inconsistent logic inside the file. Earlier on line 171, we use the somewhat better:
def _process_user_message(self, content_list: list[dict[str, Any]]) -> list[TextContent | ToolResultContent]:
return [TextContent(text=item["text"]) for item in content_list if "text" in item]
which effectively reduces the text content blocks in the message into a list (while strangely maintaining type support for ToolResults).
Additional Context
No response
Possible Solution
There should be common logic for user prompt span parsing, and it should support more than the simple case of single text input.
def _reduce_content_blocks(self, content_list: list[dict[str, Any]]) -> list[TextContent | ToolResultContent]:
"""Reduce a list of Bedrock content blocks into TextContent entries.
Handles text, document, image, and tool result blocks. Non-text content
is represented as a descriptive placeholder so evaluators know it was present.
"""
result: list[TextContent | ToolResultContent] = []
for item in content_list:
if "text" in item:
result.append(TextContent(text=item["text"]))
elif "document" in item:
doc = item["document"]
name = doc.get("name", "unknown")
fmt = doc.get("format", "unknown")
result.append(TextContent(text=f"[Document: {name}.{fmt}]"))
elif "image" in item:
img = item["image"]
fmt = img.get("format", "unknown")
result.append(TextContent(text=f"[Image: {fmt}]"))
elif "toolResult" in item:
tool_result = item["toolResult"]
content = tool_result.get("content", [])
text = content[0].get("text", "") if isinstance(content, list) and content else str(content)
result.append(ToolResultContent(
content=text,
error=tool_result.get("error"),
tool_call_id=tool_result.get("toolUseId"),
))
return result
Related Issues
No response