-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
Motivation
I propose to extend flexeval to support Multimodal Language Models (MLMs), which take multimodal input data as inputs and generate text as outputs. A typical example is Vision Language Models (VLMs), accepting text+image as inputs.
Since the primary difference between MLMs and standard LLMs lies in the input format, the necessary modifications are exlusively focused on input data handling.
Expected changes
Assuming MLMs are run separately and accessed via OpenAI API schema, the main changes are required in the template processing logic:
| messages = [{"role": "user", "content": input_utterance}] |
expecting input_utterance is str.
MLMs require a structured list of dicts, like:
{
"role": "user",
"content": [
{"type": "text", "text": "..."},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}},
{"type": "image_url", "image_url": {"url": "/path/to/image.jpg",}},
],
},Thus, proposed implementation details are:
- Allow
input_utteranceto be parsed intolist[dict[str, Any]]usingast.literal_evalorjson.loads - (Optional, but recommended) Add support for preprocessing functions, e.g., image resizing, as arguments will be useful.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels