[Feature Request] Multimodal input data support

# Motivation

I propose to extend flexeval to support Multimodal Language Models (MLMs), which take multimodal input data as inputs and generate text as outputs. A typical example is Vision Language Models (VLMs), accepting text+image as inputs.
Since the primary difference between MLMs and standard LLMs lies in the input format, the necessary modifications are exlusively focused on input data handling.

# Expected changes

Assuming MLMs are run separately and accessed via OpenAI API schema, the main changes are required in the template processing logic:

https://github.com/sbintuitions/flexeval/blob/811ebc9dde9a22241abccb5acea7cd28f4f61a2d/flexeval/core/chat_dataset/template_based.py#L109

expecting `input_utterance` is `str`.
MLMs require a structured list of dicts, like:

```python
{
    "role": "user",
    "content": [
        {"type": "text", "text": "..."},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}},
        {"type": "image_url",  "image_url": {"url": "/path/to/image.jpg",}},
    ],
},
```

Thus, proposed implementation details are:
1. Allow `input_utterance` to be parsed into `list[dict[str, Any]]` using `ast.literal_eval` or `json.loads`
2. (Optional, but recommended) Add support for preprocessing functions, e.g., image resizing, as arguments will be useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Multimodal input data support #277

Motivation

Expected changes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Multimodal input data support #277

Description

Motivation

Expected changes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions