Skip to content

[BUG] eval functions _convert_agent_invocation_span and _process_user_message do not support multimodal input #130

@goDylanMorgan

Description

@goDylanMorgan

Checks

  • I have updated to the lastest minor and patch version of Strands and evals
  • I have checked the documentation and this is not expected behavior
  • I have searched ./issues and there are no duplicates of my issue

Strands Version

0.1.5

Strands Evals Version

1.24.0

Python Version

3.12.3

Operating System

Ubuntu 24.04.3 LTS

Installation Method

None

Steps to Reproduce

Minimal Repro Script

from strands import Agent                                                                                                              
  from strands_evals import Case, Experiment          
  from strands_evals.evaluators import GoalSuccessRateEvaluator                                                                          
  from strands_evals.mappers import StrandsInMemorySessionMapper
  from strands_evals.telemetry import StrandsEvalsTelemetry                                                                              
                                                                         
  telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
  memory_exporter = telemetry.in_memory_exporter

  def task_fn(case):
      memory_exporter.clear()
      agent = Agent(
          trace_attributes={"session.id": case.session_id, "gen_ai.conversation.id": case.session_id},
          callback_handler=None,
      )
      result = agent([
          {"document": {"name": "invoice", "format": "pdf", "source": {"bytes": open("invoice.pdf", "rb").read()}}},
          {"text": "Analyze this document"},
      ])
      mapper = StrandsInMemorySessionMapper()
      session = mapper.map_to_session(memory_exporter.get_finished_spans(), session_id=case.session_id)

      # Bug: session.traces[0].spans[0].user_prompt will be "" because
      # the mapper does content_list[0].get("text", "") and index 0 is the document block
      print(f"user_prompt: '{session.traces[0].spans[0].user_prompt}'")

      return {"output": str(result), "trajectory": session}

  experiment = Experiment(
      cases=[Case(input="Analyze this document")],
      evaluators=[GoalSuccessRateEvaluator()],
  )
  reports = experiment.run_evaluations(task=task_fn)
  reports[0].run_display()

Expected Behavior

A multimodal (or any input prompt with multiple content blocks) should be analyzed by the evaluator. Absent that behavior, common enterprise use cases like agentic document processing cannot be tested effectively because the evaluators don't see what the model saw so they can't effectively evaluate on Faithfulness, Goal Success etc...

Actual Behavior

Because
strands_evals/mappers/strands_in_memory_session_mapper.py's helper function logic in _convert_agent_invocation_span which is invoked for each test case to convert the OTEL spans before the evaluators run uses the following logic:

for event in span.events:
                try:
                    event_attributes = event.attributes
                    if not event_attributes:
                        continue
                    if event.name == "gen_ai.user.message":
                        content_list = self._parse_json_attr(event_attributes, "content")
                        user_prompt = content_list[0].get("text", "") if content_list else ""
                    elif event.name == "gen_ai.choice":
                        msg = event_attributes.get("message", "") if event_attributes else ""
                        agent_response = str(msg)
                except Exception as e:
                    logger.warning(f"Failed to process agent event {event.name}: {e}")

The user_prompt behavior here explicitly targets the first content block in the message, and only if it's text. So a message that starts with a document or an image returns blank here, and a message with multiple text content blocks will be truncated.

This is either a bug or an undocumented limitation. It is also inconsistent logic inside the file. Earlier on line 171, we use the somewhat better:

    def _process_user_message(self, content_list: list[dict[str, Any]]) -> list[TextContent | ToolResultContent]:
        return [TextContent(text=item["text"]) for item in content_list if "text" in item]

which effectively reduces the text content blocks in the message into a list (while strangely maintaining type support for ToolResults).

Additional Context

No response

Possible Solution

There should be common logic for user prompt span parsing, and it should support more than the simple case of single text input.

def _reduce_content_blocks(self, content_list: list[dict[str, Any]]) -> list[TextContent | ToolResultContent]:                         
      """Reduce a list of Bedrock content blocks into TextContent entries.                                                               
                                                                                                                                         
      Handles text, document, image, and tool result blocks. Non-text content                                                            
      is represented as a descriptive placeholder so evaluators know it was present.                                                     
      """                                                                                                                                
      result: list[TextContent | ToolResultContent] = []
      for item in content_list:
          if "text" in item:
              result.append(TextContent(text=item["text"]))
          elif "document" in item:
              doc = item["document"]
              name = doc.get("name", "unknown")
              fmt = doc.get("format", "unknown")
              result.append(TextContent(text=f"[Document: {name}.{fmt}]"))
          elif "image" in item:
              img = item["image"]
              fmt = img.get("format", "unknown")
              result.append(TextContent(text=f"[Image: {fmt}]"))
          elif "toolResult" in item:
              tool_result = item["toolResult"]
              content = tool_result.get("content", [])
              text = content[0].get("text", "") if isinstance(content, list) and content else str(content)
              result.append(ToolResultContent(
                  content=text,
                  error=tool_result.get("error"),
                  tool_call_id=tool_result.get("toolUseId"),
              ))
      return result

Related Issues

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions