Skip to content

[Feature Request] Multimodal input data support #277

@moskomule

Description

@moskomule

Motivation

I propose to extend flexeval to support Multimodal Language Models (MLMs), which take multimodal input data as inputs and generate text as outputs. A typical example is Vision Language Models (VLMs), accepting text+image as inputs.
Since the primary difference between MLMs and standard LLMs lies in the input format, the necessary modifications are exlusively focused on input data handling.

Expected changes

Assuming MLMs are run separately and accessed via OpenAI API schema, the main changes are required in the template processing logic:

messages = [{"role": "user", "content": input_utterance}]

expecting input_utterance is str.
MLMs require a structured list of dicts, like:

{
    "role": "user",
    "content": [
        {"type": "text", "text": "..."},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}},
        {"type": "image_url",  "image_url": {"url": "/path/to/image.jpg",}},
    ],
},

Thus, proposed implementation details are:

  1. Allow input_utterance to be parsed into list[dict[str, Any]] using ast.literal_eval or json.loads
  2. (Optional, but recommended) Add support for preprocessing functions, e.g., image resizing, as arguments will be useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions